CN111159557B

CN111159557B - Hot spot information acquisition method, device, server and medium

Info

Publication number: CN111159557B
Application number: CN201911409021.8A
Authority: CN
Inventors: 唐颢诚; 姜文; 陆祁; 周寻; 孙斌
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-07-25
Anticipated expiration: 2039-12-31
Also published as: CN111159557A

Abstract

The embodiment of the invention provides a method, a device, a server and a medium for acquiring hot spot information, and relates to the technical field of information processing. The scheme of the application comprises the following steps: based on word segmentation operation on text content in a designated data source, a first word segmentation result and word frequency information of each word segmentation are obtained, at least one main word is selected from the first word segmentation result, then, for each main word, based on word segmentation operation on the text content containing the main word in the designated data source, a second word segmentation result and word frequency information of each word segmentation are obtained, at least one auxiliary word associated with the main word is obtained from the second word segmentation result, further, text content comprising the main word and at least one auxiliary word associated with the main word is obtained from the designated data source, and hot spot information corresponding to the main word is generated based on the obtained text content. By adopting the method, the hot spot information can be comprehensively acquired in real time.

Description

Hot spot information acquisition method, device, server and medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a method, an apparatus, a server, and a medium for obtaining hotspot information.

Background

As users of various types of websites increase, user generated content (User Generated Content, UGC) text produced in the websites has seen explosive growth, such as comments, barrages, and the like. Mining out hot events and hot words from massive texts is significant for content popularization and public opinion guiding understanding.

In the related art, operators manually find hot words and hot events from texts in websites and acquire hot words and hot word related texts, however, the operators have limited energy, so that the operators cannot know all hot words and hot events in the websites, and the acquired information has certain hysteresis, so that the hot information is difficult to acquire comprehensively in real time.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a device, a server and a medium for acquiring hot spot information, so as to realize real-time comprehensive acquisition of the hot spot information. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for obtaining hotspot information, where the method is executed in a server, and includes:

based on word segmentation operation on text content in a specified data source, a first word segmentation result and word frequency information of each word segmentation are obtained, and at least one main word is selected from the first word segmentation result;

Aiming at each main word, performing word segmentation operation on text content containing the main word in the appointed data source to obtain a second word segmentation result and word frequency information of each word segmentation, and acquiring at least one auxiliary word associated with the main word from the second word segmentation result;

and acquiring text content comprising the main word and at least one auxiliary word associated with the main word from the appointed data source, and generating hot spot information corresponding to the main word based on the acquired text content.

In one possible implementation manner, the performing word segmentation on text content in a specified data source to obtain a first word segmentation result and word frequency information of each word segmentation, and selecting at least one main word from the first word segmentation result includes:

acquiring text content generated in the designated data source in a first historical time period, respectively generating text documents from the text content corresponding to each sub-time period in the first historical time period, and performing word segmentation on the text documents corresponding to each sub-time period to obtain a first word segmentation result corresponding to the text documents;

generating a main word candidate word set corresponding to the text document based on a first word segmentation result corresponding to the text document;

And selecting at least one main word from the main word candidate word set based on word frequency information of each word in the main word candidate word set.

In one possible implementation manner, the selecting at least one main word from the main word candidate word set based on word frequency information of each word in the main word candidate word set includes:

and calculating TF-IDF values of the words included in the main word candidate word set, and selecting a first preset number of words from the main word candidate word set as main words according to the sequence of the TF-IDF values from large to small.

In one possible implementation manner, the performing word segmentation operation on the text content including the main word in the specified data source based on each main word to obtain a second word segmentation result and word frequency information of each word segmentation, and obtaining at least one auxiliary word associated with the main word from the second word segmentation result includes:

for each main word, acquiring a text content set containing the main word generated in the appointed data source in a second historical time period before a sub-time period to which the main word belongs, and performing word segmentation on each text content in the text content set to obtain a second word segmentation result;

Generating an auxiliary word candidate word set corresponding to the main word based on the second word segmentation result;

and selecting at least one auxiliary word associated with the main word from the auxiliary word candidate word set based on word frequency information of each word in the auxiliary word candidate word set.

In one possible implementation manner, the selecting, based on word frequency information of each word in the set of auxiliary word candidates, at least one auxiliary word associated with the main word from the set of auxiliary word candidates includes:

determining the co-occurrence frequency of each auxiliary word candidate word included in the auxiliary word candidate word set and the main word in the text content set;

and taking the auxiliary word candidate words with the co-occurrence times larger than a preset time threshold value in the auxiliary word candidate word set as auxiliary words associated with the main word.

In one possible implementation manner, obtaining text content including the main word and at least one auxiliary word associated with the main word from the specified data source, generating hot spot information corresponding to the main word based on the obtained text content, including:

for each auxiliary word associated with the main word, acquiring text contents including the main word and the auxiliary word in the text content set, and forming the acquired text contents into a candidate auxiliary text set corresponding to the auxiliary word;

Performing de-duplication processing on text contents in the candidate auxiliary text set, and taking the text contents remained in the candidate auxiliary text set after the de-duplication processing as auxiliary text of the auxiliary word;

and generating hot spot information corresponding to the main word, wherein the hot spot information comprises the main word, auxiliary words associated with the main word and auxiliary text of each auxiliary word.

In a possible implementation manner, the performing a deduplication process on text content in the candidate auxiliary text set includes:

and aiming at each text content in the candidate auxiliary text set, calculating cosine similarity between the text content and each text content in a second preset number of text contents after the text content, and deleting the text content with the cosine similarity greater than a preset similarity threshold value.

In one possible implementation manner, the generating the main word candidate word set corresponding to the text document based on the first word segmentation result corresponding to the text document includes:

filtering the segmented words included in the first segmented word result based on a preset filtering rule, and adding the segmented words which are not filtered into the main word candidate word set.

In one possible implementation manner, generating the auxiliary word candidate word set corresponding to the main word based on the second word segmentation result includes:

and filtering the segmented words included in the second segmented word result based on a preset filtering rule, and adding the segmented words which are not filtered into the auxiliary word candidate word set.

In one possible implementation, the first word segmentation result and the second word segmentation result each include a part of speech of each word;

the preset filtering rules comprise any one or more of the following filtering conditions:

if the part of speech of the word is the appointed part of speech, filtering the word;

if the word segmentation belongs to a preset keyword set, filtering the word segmentation;

and if the specific type character ratio included in the word segmentation is larger than the preset proportion, filtering the word segmentation.

In one possible implementation manner, the preset filtering rule further includes: if the word belongs to the white list word stock, the word is set to be in a state that the word cannot be filtered.

In one possible implementation, before the acquiring the text content generated in the specified data source within the first historical period, the method further includes:

when new text contents are generated in the appointed data source, word segmentation operation is carried out on the text contents aiming at each new text content generated in the appointed data source;

Correspondingly storing the segmentation words included in the text content and the text content identification of the text content into a first preset database;

correspondingly storing a text content identifier of the text content, the text content and the generation time of the text content in a second preset database;

the step of obtaining text content generated in the specified data source in the first historical time period, generating text documents from the text content corresponding to each sub-time period in the first historical time period, and performing word segmentation on the text documents corresponding to each sub-time period to obtain a first word segmentation result corresponding to the text documents, wherein the step of generating the text documents comprises the following steps:

acquiring text contents with the generation time belonging to the first historical time period from the second preset database;

respectively generating a text document from the acquired text contents, wherein the generation time of the text contents belongs to each sub-time period in the first historical time period;

aiming at the text document corresponding to each sub-time period, according to the text content identification of each piece of text content included in the text document, acquiring the word segmentation included in each piece of text content from the first preset database, and obtaining a first word segmentation result corresponding to the text document;

The step of obtaining, for each main word, a text content set containing the main word generated in the specified data source in a second historical time period before a sub-time period to which the main word belongs, and performing word segmentation processing on each piece of text content in the text content set to obtain a second word segmentation result, where the word segmentation result includes:

for each main word, acquiring text content which belongs to the second historical time period and contains the main word from the second preset database, and forming the acquired text content into the text content set;

and obtaining the segmentation included in each piece of text content from the second preset database according to the identification of each piece of text content included in the text content set, and obtaining the second segmentation result.

In a second aspect, an embodiment of the present application provides a hotspot information obtaining apparatus, where the apparatus is applied to a server, and the apparatus includes:

the main word generation module is used for carrying out word segmentation operation on text content in a designated data source to obtain a first word segmentation result and word frequency information of each word segmentation, and selecting at least one main word from the first word segmentation result;

the auxiliary word generation module is used for carrying out word segmentation operation on the text content containing the main word in the appointed data source according to each main word to obtain a second word segmentation result and word frequency information of each word segmentation, and obtaining at least one auxiliary word associated with the main word from the second word segmentation result;

And the hot spot information generation module is used for acquiring text contents comprising the main word and at least one auxiliary word associated with the main word from the appointed data source, and generating hot spot information corresponding to the main word based on the acquired text contents.

In one possible implementation manner, the main word generating module is specifically configured to:

In one possible implementation manner, the auxiliary word generating module is specifically configured to:

In one possible implementation manner, the hotspot information generating module is specifically configured to:

In one possible implementation manner, the main word generating module is further configured to:

In one possible implementation manner, the adverb generating module is further configured to:

In one possible implementation, the apparatus further includes:

the word segmentation module is used for carrying out word segmentation operation on the text content aiming at each new text content generated in the appointed data source when the new text content is generated in the appointed data source;

the storage module is used for storing the segmentation words included in the text content and the text content identification of the text content to a first preset database correspondingly, and storing the text content identification of the text content, the text content and the generation time of the text content to a second preset database correspondingly;

The main word generation module is further configured to:

the auxiliary word generation module is further configured to:

In a third aspect, an embodiment of the present application provides a server, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

A memory for storing a computer program;

and a processor, configured to implement the method steps described in the first aspect when executing the program stored in the memory.

In a fourth aspect, embodiments of the present application provide a machine-readable storage medium having stored thereon machine-executable instructions that, when invoked and executed by a processor, cause the processor to: the method steps described in the first aspect are implemented.

In a fifth aspect, embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect described above.

By adopting the method and the device for acquiring the hot spot information, the server can acquire the first word segmentation result and word frequency information of each word segmentation based on word segmentation operation on text content in the appointed data source, further select at least one main word from the first word segmentation result, then acquire a second word segmentation result and word frequency information of each word segmentation based on word segmentation operation on the text content containing the main word in the appointed data source, acquire at least one auxiliary word associated with the main word from the second word segmentation result, further acquire text content comprising the main word and at least one auxiliary word associated with the main word, and further generate hot spot information corresponding to the main word based on the acquired text content. Compared with manual information processing, the server can acquire text content in a specified data source more timely, and determine main words and auxiliary words based on word frequency information of word segmentation included in the text content.

Of course, it is not necessary for any one product or method of practicing the invention to achieve all of the advantages set forth above at the same time.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for obtaining hotspot information provided in an embodiment of the present application;

fig. 2 is a flowchart of another method for obtaining hotspot information according to an embodiment of the present application;

fig. 3 is a flowchart of another method for obtaining hotspot information according to an embodiment of the present application;

fig. 4 is a flowchart of another method for obtaining hotspot information according to an embodiment of the present application;

fig. 5 is a flowchart of another method for obtaining hotspot information according to an embodiment of the present application;

fig. 6 is a schematic diagram of a method for obtaining hotspot information according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a hotspot information obtaining apparatus according to an embodiment of the present application;

Fig. 8 is a schematic structural diagram of another hotspot information obtaining apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

An embodiment of the present application provides a method for obtaining hotspot information, where the method is executed in a server, as shown in fig. 1, and the method includes:

s101, performing word segmentation operation on text content in a specified data source to obtain a first word segmentation result and word frequency information of each word segmentation, and selecting at least one main word from the first word segmentation result.

S102, aiming at each main word, performing word segmentation operation on text content containing the main word in a designated data source to obtain a second word segmentation result and word frequency information of each word segmentation, and acquiring at least one auxiliary word associated with the main word from the second word segmentation result.

S103, acquiring text content comprising a main word and at least one auxiliary word associated with the main word from a specified data source, and generating hot spot information corresponding to the main word based on the acquired text content.

By adopting the hot spot information acquisition method provided by the embodiment of the application, the server can obtain a first word segmentation result and word frequency information of each word segmentation based on word segmentation operation of text content in a designated data source, further select at least one main word from the first word segmentation result, then perform word segmentation operation on the text content containing the main word in the designated data source for each main word, obtain a second word segmentation result and word frequency information of each word segmentation, obtain at least one auxiliary word associated with the main word from the second word segmentation result, further obtain text content comprising the main word and at least one auxiliary word associated with the main word, and further generate hot spot information corresponding to the main word based on the obtained text content. Compared with manual information processing, the server can acquire text content in a specified data source more timely, and determine main words and auxiliary words based on word frequency information of word segmentation included in the text content.

In S101 described above, the specified data source may be a website capable of generating text content, such as various types of video websites, social networking websites, or the like.

After the first word segmentation result and the word frequency information of each word segmentation are obtained, the server can select at least one main word from the first word segmentation result according to the word frequency information of each word segmentation. For example, a word segment with a higher word frequency is selected as the main word.

In S102, after the second word segmentation result is obtained, the server may select, from the second word segmentation result, the word segmentation with higher word frequency as the auxiliary word according to the word frequency information of each analysis, and it may be understood that the higher the word frequency of the word segmentation in the second word segmentation result, the higher the association degree between the word segmentation and the main word, so that the word segmentation with high association degree with the main word may be selected as the auxiliary word of the main word based on the rule.

In one embodiment, as shown in fig. 2, S101, based on the word segmentation operation on the text content in the specified data source, obtains a first word segmentation result and word frequency information of each word segmentation, and selects at least one main word from the first word segmentation result, which may be specifically implemented as the following steps:

s1011, acquiring text contents generated in a designated data source in a first historical time period, respectively generating text documents from the text contents corresponding to each sub-time period in the first historical time period, and performing word segmentation on the text documents corresponding to each sub-time period to obtain a first word segmentation result corresponding to the text documents.

Wherein the first history period may be the last 15 days, and each sub-period may be 4 to 8 hours.

The text content released by the user in various websites is generally comment information, and the comment information only represents information of a time point and cannot reflect hot spots in a time period, so that the server can acquire text content of 15 days recently, and in order to calculate word frequency information, the acquired text content can be further used for generating text documents every 4 hours.

In addition, the term frequency information in the embodiment of the present application may be a term frequency-inverse text frequency index (TF-IDF) value, and since the length of a single comment message is shorter and the TF-IDF value needs to be calculated according to a document, in the embodiment of the present application, a time segment segmentation method may be adopted, text content in each sub-time segment may be generated into a text document, and the text documents corresponding to each sub-time segment included in the first historical time segment form a corpus.

S1012, generating a main word candidate word set corresponding to the text document based on the first word segmentation result corresponding to the text document.

Alternatively, the server may use all the segmented words included in the first segmented word result as the dominant word candidate words, thereby generating the dominant word candidate word set.

Or, the server may filter the word segments included in the first word segment result based on a preset filtering rule, and add the unfiltered word segments to the main word candidate word set.

S1013, selecting at least one main word from the main word candidate word set based on word frequency information of each word in the main word candidate word set.

If the word frequency information is TF-IDF, the step may be specifically implemented as: and calculating TF-IDF values of the words included in the main word candidate word set, and selecting a first preset number of words from the main word candidate word set to serve as main words according to the sequence of the TF-IDF values from large to small.

The server may calculate TF-IDF values for each term included in the set of dominant term candidates by the following formula:

TF-IDF＝TF×IDF；

wherein, the liquid crystal display device comprises a liquid crystal display device,

the first number may be a preset value, or may be determined based on a preset proportion threshold, for example, 10% of words in the candidate word set of the main word are selected as the main word according to the order of the TF-IDF values from the high to the low.

By adopting the method, the server can acquire text content generated in the appointed data source in the first historical time period, further generate a plurality of text documents based on the acquired text content, respectively generate a main word candidate word set based on the first word segmentation result of each text document, and select at least one main word based on word frequency information of each word in the main word candidate word set.

In one embodiment, based on the method flow shown in fig. 2, as shown in fig. 3, S102, for each main word, performs word segmentation operation on text content including the main word in the specified data source to obtain a second word segmentation result and word frequency information of each word segmentation, and obtains at least one auxiliary word associated with the main word from the second word segmentation result, which may be specifically implemented as the following steps:

s1021, aiming at each main word, acquiring a text content set containing the main word generated in a designated data source in a second historical time period before a sub-time period to which the main word belongs, and performing word segmentation on each text content in the text content set to obtain a second word segmentation result.

For example, the second historical time period may be 15 days before the sub-time period to which the main word belongs, that is, for each main word, the server may obtain, for the 15 days, text content including the main word that is published by the user in the website, and form the text content including the main word into a text content set.

And S1022, generating a secondary word candidate word set corresponding to the main word based on the second word segmentation result.

Alternatively, the server may use all the tokens included in the second token result as the tokens candidate words, thereby generating the token candidate word set.

Or, the server may filter the word segments included in the second word segment result based on a preset filtering rule, and add the word segments that are not filtered to the set of auxiliary word candidate words.

S1023, selecting at least one auxiliary word associated with the main word from the auxiliary word candidate word set based on word frequency information of each word in the auxiliary word candidate word set.

The word frequency information can be specifically the co-occurrence frequency of the auxiliary word candidate word and the main word in the text content set, and the step can be specifically realized as follows: and determining the co-occurrence frequency of each auxiliary word candidate word included in the auxiliary word candidate word set and the main word in the text content set, and taking the auxiliary word candidate word with the co-occurrence frequency greater than a preset frequency threshold value in the auxiliary word candidate word set as an auxiliary word associated with the main word.

Wherein, the co-occurrence of the auxiliary word candidate word and the main word means that the auxiliary word candidate word and the main word are simultaneously included in a piece of text content.

For example, if 1000 pieces of text content are included in the text content set, and the main word a and one of the auxiliary word candidate words B corresponding to the main word a co-occur in 300 pieces of text content in the text content set, it may be determined that the co-occurrence number corresponding to the auxiliary word candidate word B is 300.

It may be appreciated that, through this step, the server may determine the number of co-occurrences of each of the auxiliary word candidates corresponding to one main word, for example, if the auxiliary word candidate set of the main word a includes 500 auxiliary word candidates, the server determines the number of co-occurrences of each of the 500 auxiliary word candidates, respectively.

As an example, the preset number of times threshold is 500, that is, the auxiliary word candidate word having the number of co-occurrences greater than 500 is used as the auxiliary word of the main word, and of course, the value of the preset number of times threshold is not limited thereto and may be set according to actual situations.

By adopting the method, the server can respectively determine the auxiliary word corresponding to each main word, and the auxiliary word is a word which has more occurrence times together with the main word in the text document set, so the auxiliary word can also be used as a hot spot word, and the hot spot information is determined together by the main word and the auxiliary word later.

Optionally, the first word segmentation result and the second word segmentation result each include part of speech of each word, and the filtering rules involved in S1012 and S1022 include any one or more of the following filtering conditions:

The specified part of speech is a part of speech which is preconfigured and cannot be used as a hot word, such as a conjunctive word, an exclamation word, an azimuth word, a time word, a main word, a mood word and the like.

The preset keyword set may also be called a preset stop word set, where words having no specific meaning, such as "on", "one day", and the like, are included in the preset keyword set.

The specified type of characters can be English letters and numbers, and if the sum of the ratio of the English letters and the numbers in one word is larger than a preset ratio, the word is filtered. Alternatively, the preset proportion may be 50%.

For example, if a term is "0.00001%", the term may be filtered.

By adopting the method, the influence of the nonsensical word segmentation on the hot spot information on the subsequent flow can be avoided by filtering the first word segmentation result and the second word segmentation result, and the calculated amount can be reduced.

Because the names of the star combinations or the names of the stars can be used as hot words, but the names of some star combinations or the names of the stars can be selected by English letters or other words without specific meaning, the words which can be used as hot words can be prevented from being filtered by setting a white list word stock. The preset filtering rules may also include: if the word belongs to the white list word stock, the word is set to be in a state that the word cannot be filtered.

It can be understood that, for the word in the non-filterable state, even if the word meets the three filtering conditions, the word is not filtered, so as to avoid the word possibly becoming a hot word from being filtered, so that the hot information acquired later is more accurate.

In one embodiment, based on the method flow shown in fig. 3, as shown in fig. 4, S103, obtains text content including a main word and at least one auxiliary word associated with the main word from a specified data source, and generates hot spot information corresponding to the main word based on the obtained text content, which may be specifically implemented as the following steps:

s1031, aiming at each auxiliary word associated with the main word, acquiring the text content comprising the main word and the auxiliary word in the text content set, and forming the acquired text content into a candidate auxiliary text set corresponding to the auxiliary word.

For example, if the main word and the auxiliary word co-occur 300 times in the text content set, 300 pieces of text content including the main word and the auxiliary word in the text content set may be obtained, and the 300 pieces of text content may be formed into a candidate auxiliary text set corresponding to the auxiliary word.

S1032, performing de-duplication processing on the text content in the candidate auxiliary text set, and taking the rest text content in the candidate auxiliary text set after the de-duplication processing as auxiliary text of the auxiliary word.

The method for the de-duplication treatment comprises the following steps: and calculating cosine similarity between each piece of text content in a second preset number of text content behind the text content aiming at each piece of text content in the candidate auxiliary text set, and deleting the text content with the cosine similarity larger than a preset similarity threshold value.

For example, if 300 pieces of text content are included in the candidate auxiliary text set corresponding to one auxiliary word, the server may perform deduplication processing on the 300 pieces of text content.

Assuming that the second preset number is 200, calculating cosine similarity of the text content and each text content in 200 text contents after the text for the 1 st text content, and if the cosine similarity between the text content and the 2 nd text content is smaller than a preset similarity threshold (for example, 80%), reserving the 2 nd text content; if the cosine similarity between the text content and the 3 rd text content is greater than a preset similarity threshold (e.g., 80%), deleting the 3 rd text content, and so on until judging whether the cosine similarity between the text content and the 200 th text content is greater than 80%.

And then calculating cosine similarity between the text content and the 3 rd text content based on the 2 nd text content in the adjusted auxiliary text set, deleting the 3 rd text content if the cosine similarity is more than 80%, retaining the 3 rd text content if the cosine similarity is less than 80%, and the like until judging whether the cosine similarity between the text content and the 201 st text content is more than 80%.

And then carrying out the processing based on the subsequent text content in the adjusted auxiliary text set until the cosine similarity between the 2 nd and 1 st text content in the auxiliary text set is judged, and determining to delete or retain the 1 st text content based on the calculated cosine similarity. At this time, the rest text content in the auxiliary text set can be used as auxiliary text of the auxiliary word.

Optionally, based on the habit of the user, since the user generally cannot see text content beyond the preset number (for example, 2000 pieces), in order to reduce the computational complexity, only the text content of the preset number included in the auxiliary text set may be subjected to the deduplication processing.

Optionally, in another embodiment, if it is calculated that the cosine similarity between the text content and one of the text contents is greater than the preset similarity threshold, deleting the text content with a lower heat value in the two text contents.

The heat information of the text content can comprise a plurality of dimensions, and the heat value of the text content can be obtained by carrying out weighted summation on the heat information of each dimension.

Taking comment messages in a video website as an example, the popularity information includes the reading amount, the praise amount and the reply amount of the comment messages. The popularity value of the comment message may be: 0.2 reading +0.4 praise +0.4 return. Wherein, 0.2 is the weight of reading quantity, 0.4 is the weight of praise quantity, and 0.4 is the weight of reply quantity, in practical application, each dimension included in the heat information and the weight of each dimension can be set according to practical conditions, which is not limited in this application.

Alternatively, since the heat of the text content may change with time, the embodiment of the present application may also multiply the attenuation coefficient based on the weighted summation when calculating the heat value.

The formula of the attenuation coefficient is:

T(t)＝T(t_0)×e^(-k(t_0-t))

wherein T (T) is an attenuation coefficient, t_0 is a generation time of text content, T is a current time, and K is an adjustment coefficient. The value of the adjustment coefficient can be set according to actual conditions, for example, the value of the attenuation coefficient can be influenced by adjusting the value of K, so that the heat value of the text content is influenced, and the text content can be reduced to half of the heat value before the day after the day according to actual requirements.

S1033, generating hot spot information corresponding to the main word, wherein the hot spot information comprises the main word, auxiliary words associated with the main word and auxiliary text of each auxiliary word.

After determining the auxiliary word corresponding to the main word, the server determines text contents comprising the main word and the main word according to the combination of the main word and each auxiliary word, wherein the text contents are hot text contents in a website, but text contents such as comment messages issued by users in the website have repeated phenomena, so that repeated text contents in candidate auxiliary text sets are further subjected to repeated operation in the embodiment of the application, repeated text contents are filtered, the repetition rate of finally determined hot information can be reduced, accurate and concise hot information can be determined, and the workload of operators is reduced.

In another implementation manner provided in the embodiment of the present application, before obtaining the text content generated in the specified data source in the first historical period, a distributed storage method may be used to store the text content and the word segmentation of the text content, as shown in fig. 5, and specifically includes the following steps:

s501, when new text content is generated in the designated data source, word segmentation operation is performed on each piece of new text content generated in the designated data source.

The server can acquire the text content generated in the appointed data source in real time, and when new text content is generated in the appointed data source, the server can receive the notification message, and further acquire the newly added or changed text content from the message queue.

Optionally, the server may also filter text content generated in the specified data source that is unlikely to be hot spot information. Taking a video website as an example, text content of a designated channel can be filtered, for example, if hot spot information in entertainment news needs to be generated, text content generated in a game channel is irrelevant to the hot spot information, so that text content generated in the game channel can be filtered. Of course, the filtering method for the text in the practical application is not limited to this, and the filtering rule may be set based on the practical requirement.

S502, storing the segmentation words included in the text content and the text content identification of the text content to a first preset database correspondingly.

Alternatively, the first preset database may be a remote dictionary service (Remote Dictionary Server, dis) database.

S503, storing the text content identification of the text content, the text content and the generation time of the text content in a second preset database correspondingly.

The server may also store the popularity information of the text content in a second preset database. The popularity information includes information such as praise number, comment number, reading number and the like of the text content.

Alternatively, the second preset database may be a couchbase database.

Based on this storage manner, S1011, acquires text content generated in the specified data source in the first historical time period, generates a text document from the text content corresponding to each sub-time period in the first historical time period, and performs word segmentation processing on the text document corresponding to each sub-time period to obtain a first word segmentation result corresponding to the text document, which may be specifically implemented as follows:

and acquiring the text content with the generation time belonging to the first historical time period from the second preset database.

And respectively generating a text document from the acquired text contents, wherein the generation time of the text contents belongs to each sub-time period in the first historical time period.

And aiming at the text document corresponding to each sub-time period, acquiring the word segmentation included in each text content from a first preset database according to the text content identification of each text content included in the text document, and obtaining a first word segmentation result corresponding to the text document.

Similarly, for each main word, the step S1021 of obtaining a text content set including the main word generated in the specified data source in the second historical time period before the sub-time period to which the main word belongs, and performing word segmentation processing on each piece of text content in the text content set to obtain a second word segmentation result may be specifically implemented as follows:

and aiming at each main word, acquiring text contents with the generation time belonging to a second historical time period from a second preset database and containing the main word, forming a text content set from the acquired text contents, and acquiring the word segmentation included in each text content from the second preset database according to the identification of each text content included in the text content set to obtain a second word segmentation result.

Therefore, by adopting the storage mode, when new text content is generated in the appointed data source, the text content is segmented, then the text content is marked as an index, the segmented words included in the text content are stored in a first preset database, and the text content, the generation time and the heat information of the text content are stored in a second preset database. When the hot spot information needs to be generated, the server can directly acquire text content generated in a first historical time period from the first preset database, and acquire word segmentation of the text content from the second preset database, namely, in the generation stage of the hot spot information, the hot spot information can be generated by directly utilizing information stored in the first preset database and the second preset database, and operations such as word segmentation calculation and the like are not needed in the process, so that the time required by the server for generating the hot spot information can be shortened, and the generation efficiency of the hot spot information is improved.

Optionally, after the server generates the hotspot information corresponding to the main word, for each auxiliary word associated with the main word, a heat value of each auxiliary text of the auxiliary word is obtained from a second preset database, and then each auxiliary text of the auxiliary word is displayed according to the order of the heat values from large to small.

Alternatively, in another embodiment, each of the auxiliary texts may be displayed in order of the generation time from early to late or from late to early.

Optionally, in another implementation manner provided in the embodiment of the present application, after the main word and at least one auxiliary word associated with the main word are selected, a main word list and an auxiliary word list corresponding to each main word may be further displayed, so that the operation and maintenance personnel select the target hotspots word based on the main word list and the auxiliary word list corresponding to each main word.

The target hotspot word comprises a main word and at least one auxiliary word associated with the main word. And generating hot spot information corresponding to the main word by the method of the S103 based on the main word included in the target hot spot word and at least one auxiliary word associated with the main word.

The target hotspot word may also include a combined hotspot word formed from a plurality of auxiliary word combinations. If the target hot words comprise the combined hot words, the server respectively acquires text content identification lists corresponding to each auxiliary word included in the combined hot words from a first preset database, and then acquires intersections of the text content identification lists corresponding to each auxiliary word to serve as auxiliary text content identification lists of the combined hot words.

For example, if one of the hot words is a word formed by combining the word B and the word C, the text content identification list corresponding to the word B and the text content identification list corresponding to the word C may be obtained from the first preset database, and an intersection is taken for the text content identification list corresponding to the word B and the text content identification list corresponding to the word C, and the intersection is used as the auxiliary text content identification list of the hot word. And then acquiring text content corresponding to each text content identifier in the auxiliary text content identifier list from a second preset database, so as to obtain auxiliary text corresponding to the combined hot words.

The following describes the flow of the information acquisition method provided in the embodiment of the present application with reference to a specific example, and the method is applied to a video website, for example, as shown in fig. 6, and the method specifically includes 6 stages in fig. 6.

In the first stage, a document is generated.

The server obtains the feed and comments within 15 days in the video website, and then combines the obtained feed and comments according to the sub-time periods to generate a text document, namely respectively generating the feed and comments within each sub-time period into a text document.

Subsequent stages are then performed on a per text document basis.

And in the second stage, generating candidate words of the main word.

And performing word segmentation on the text document to obtain a first word segmentation result, and then performing filtering processing on the first word segmentation result according to a preset filtering rule to obtain a main word candidate word set.

And thirdly, calculating TF-IDF.

And calculating TF of each word included in the main word candidate word set, calculating IDF of each word, and calculating TF-IDF of each word according to the TF and the IDF of each word.

And step four, determining the main word.

And ordering the words included in the main word candidate word set according to the TF-IDF (from large to small) sequence, and taking the first preset number of words with the largest TF-IDF as main words.

And fifth, generating the auxiliary word.

For each main word, feed and comments of the main word are included 15 days before a time period corresponding to the currently processed text document is acquired from the video document.

The obtained feed and comments are segmented to obtain a second segmentation result, the second segmentation result is filtered according to a preset filtering rule, a secondary word candidate word set corresponding to the main word is generated, word frequencies of words in the secondary word candidate word set are counted, and the word frequencies are the co-occurrence times of the words and the topics in the obtained feed and comments.

And taking the words with word frequency greater than a preset frequency threshold as the auxiliary words of the main word, acquiring the text content comprising the auxiliary words and the main word corresponding to the auxiliary words, and generating a candidate auxiliary text set corresponding to the auxiliary words.

And a sixth stage, auxiliary text de-duplication.

And calculating cosine similarity between text contents in the auxiliary text set, and de-duplicating the text contents in the auxiliary text set based on the cosine similarity.

The hotspot information may then be output based on the processing results of the six stages described above.

Based on the same technical concept, the embodiment of the present application further provides a hotspot information obtaining apparatus, as shown in fig. 7, where the apparatus is applied to a server, and the apparatus includes:

the main word generating module 701 is configured to obtain a first word segmentation result and word frequency information of each word segmentation based on word segmentation operation on text content in a specified data source, and select at least one main word from the first word segmentation result;

the auxiliary word generating module 702 is configured to perform word segmentation operation on text content including the main word in the specified data source for each main word, obtain a second word segmentation result and word frequency information of each word segmentation, and obtain at least one auxiliary word associated with the main word from the second word segmentation result;

The hotspot information generating module 703 is configured to obtain, from the specified data source, text content including the main word and at least one auxiliary word associated with the main word, and generate hotspot information corresponding to the main word based on the obtained text content.

Optionally, the main word generating module 701 is specifically configured to:

Optionally, the adverb generating module 702 is specifically configured to:

Optionally, the hotspot information generating module 703 is specifically configured to:

Optionally, the main word generating module 701 is further configured to:

Optionally, the adverb generating module 702 is further configured to:

Optionally, the first word segmentation result and the second word segmentation result both comprise part of speech of each word;

Optionally, the preset filtering rule further includes: if the word belongs to the white list word stock, the word is set to be in a state that the word cannot be filtered.

Optionally, as shown in fig. 8, the apparatus further includes: a word segmentation module 801 and a storage module 802.

A word segmentation module 801, configured to perform word segmentation operation on each new text content generated in the specified data source when the new text content is generated in the specified data source;

a storage module 802, configured to store a word segment included in the text content and a text content identifier of the text content in a first preset database correspondingly, and store the text content identifier of the text content, and a generation time of the text content in a second preset database correspondingly;

Optionally, the main word generating module 701 is further configured to:

the adverb generating module 702 is further configured to:

The embodiment of the present application further provides a server, as shown in fig. 9, including a processor 901, a communication interface 902, a memory 903, and a communication bus 904, where the processor 901, the communication interface 902, and the memory 903 perform communication with each other through the communication bus 904,

A memory 903 for storing a computer program;

the processor 901 is configured to implement the method flow executed by the server in the method embodiment when executing the program stored in the memory 903.

The communication bus mentioned by the server may be a peripheral component interconnect standard (Peripheral Component Interconnect, PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

The communication interface is used for communication between the electronic device and other devices.

The Memory may include random access Memory (Random Access Memory, RAM) or may include Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.

The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, there is also provided a computer readable storage medium having stored therein a computer program which, when executed by a processor, implements the steps of any of the hotspot information acquisition methods described above.

In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform any of the hotspot information acquisition methods of the embodiments described above is also provided.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for obtaining hotspot information, wherein the method is executed in a server and comprises the following steps:

acquiring text content comprising the main word and at least one auxiliary word associated with the main word from the appointed data source, and generating hot spot information corresponding to the main word based on the acquired text content;

the word segmentation operation is performed on text content in a specified data source to obtain a first word segmentation result and word frequency information of each word segmentation, and at least one main word is selected from the first word segmentation result, including:

selecting at least one main word from the main word candidate word set based on word frequency information of each word in the main word candidate word set;

before the obtaining the text content generated in the specified data source for the first historical period, the method further includes:

and aiming at the text document corresponding to each sub-time period, acquiring the word segmentation included in each piece of text content from the first preset database according to the text content identification of each piece of text content included in the text document, and obtaining a first word segmentation result corresponding to the text document.

2. The method of claim 1, wherein the selecting at least one dominant word from the set of dominant word candidates based on word frequency information of each word in the set of dominant word candidates comprises:

3. The method according to claim 2, wherein the obtaining, based on the word segmentation operation for each main word based on the text content including the main word in the specified data source, a second word segmentation result and word frequency information of each word segmentation, and obtaining at least one auxiliary word associated with the main word from the second word segmentation result includes:

4. The method of claim 3, wherein the selecting at least one term associated with the subject from the set of term candidates based on term frequency information of terms in the set of term candidates comprises:

5. The method of claim 3 or 4, wherein obtaining text content including the main word and at least one auxiliary word associated with the main word from the specified data source, generating hotspot information corresponding to the main word based on the obtained text content, comprises:

6. The method of claim 5, wherein the deduplicating text content in the candidate auxiliary text set comprises:

7. The method of claim 1, wherein the generating the set of dominant word candidates corresponding to the text document based on the first word segmentation result corresponding to the text document comprises:

8. The method of claim 3, wherein generating the set of auxiliary word candidates corresponding to the main word based on the second word segmentation result comprises:

9. The method of claim 7 or 8, wherein the first word segmentation result and the second word segmentation result each include a part of speech of each word segment;

10. The method of claim 9, wherein the preset filtering rules further comprise: if the word belongs to the white list word stock, the word is set to be in a state that the word cannot be filtered.

11. The method according to claim 3, wherein for each main word, obtaining a text content set including the main word generated in the specified data source in a second historical period before a sub-period to which the main word belongs, performing word segmentation processing on each text content in the text content set, and obtaining the second word segmentation result, includes:

12. A hotspot information acquisition apparatus, wherein the apparatus is applied to a server, the apparatus comprising:

the hot spot information generation module is used for acquiring text contents comprising the main word and at least one auxiliary word associated with the main word from the appointed data source and generating hot spot information corresponding to the main word based on the acquired text contents;

the main word generation module is specifically configured to:

the apparatus further comprises: the word segmentation module and the storage module;

The main word generation module is specifically configured to:

13. The server is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for carrying out the method steps of any one of claims 1-11 when executing a program stored on a memory.

14. A machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the processor to: method steps of any of claims 1-11 are carried out.