CN111125305A - Hot topic determination method and device, storage medium and electronic equipment - Google Patents

Hot topic determination method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN111125305A
CN111125305A CN201911235739.XA CN201911235739A CN111125305A CN 111125305 A CN111125305 A CN 111125305A CN 201911235739 A CN201911235739 A CN 201911235739A CN 111125305 A CN111125305 A CN 111125305A
Authority
CN
China
Prior art keywords
topic
short text
text
subject
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911235739.XA
Other languages
Chinese (zh)
Inventor
孙学浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongsoft Group Dalian Co ltd
Neusoft Corp
Original Assignee
Dongsoft Group Dalian Co ltd
Neusoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongsoft Group Dalian Co ltd, Neusoft Corp filed Critical Dongsoft Group Dalian Co ltd
Priority to CN201911235739.XA priority Critical patent/CN111125305A/en
Publication of CN111125305A publication Critical patent/CN111125305A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The disclosure relates to a method and a device for determining hot topics, a storage medium and an electronic device. The method comprises the following steps: acquiring text information of a plurality of short texts and a plurality of comment information under at least one short text; aiming at least one piece of short text, determining target comment information similar to the text information of the short text from the plurality of pieces of comment information under the short text, and expanding the text information of the short text by utilizing the target comment information to obtain a target subject word set corresponding to the short text; and determining the hot topics according to the target topic word sets corresponding to each short text. By the technical scheme, the problem of sparsity of the short text can be effectively solved, and meanwhile, the meaning of the target subject word set corresponding to the short text obtained after the expansion is not changed compared with the meaning to be expressed by the short text. In addition, the target subject word set obtained after the short text is expanded can provide accurate basis for determining the hot topics.

Description

Hot topic determination method and device, storage medium and electronic equipment
Technical Field
The present disclosure relates to the field of network technologies, and in particular, to a method and an apparatus for determining a trending topic, a storage medium, and an electronic device.
Background
With the continuous development of network technologies, for example, microblogs, forums, and the like have become platforms for users to acquire information and communicate. The user can know some emergent hot events which happen currently and topics with high user participation degree through hot topics provided by the platform.
In the related art, when identifying a trending topic, a keyword or a feature word is generally extracted from a text issued by a publisher, and the related topic with a high frequency of occurrence of the keyword or the feature word is used as the trending topic. However, the method for directly extracting the feature words or the keywords mainly aims at long texts, the feature words are easy to extract due to the fact that the long texts have more contents, the short texts have the characteristic of sparsity with less contents, the hot topics are determined directly through the method for extracting the feature words, a good effect cannot be achieved, and the accuracy of discovering the hot topics cannot be guaranteed.
Disclosure of Invention
The purpose of the present disclosure is to provide a method, an apparatus, a storage medium, and an electronic device for determining a trending topic, so as to solve the problem of sparsity of short texts and improve the accuracy of determining the trending topic.
In order to achieve the above object, in a first aspect, the present disclosure provides a trending topic determination method, the method including: acquiring text information of a plurality of short texts and a plurality of comment information under at least one short text; aiming at least one piece of short text, determining target comment information similar to the text information of the short text from the plurality of pieces of comment information under the short text, and expanding the text information of the short text by utilizing the target comment information to obtain a target subject word set corresponding to the short text; and determining the hot topics according to the target topic word sets corresponding to each short text.
Optionally, the determining, from the plurality of pieces of comment information under the short text, target comment information similar to the text information of the short text includes: selecting a subject word from the text information of the short text to obtain a first subject word set corresponding to the short text; performing subject word selection on each piece of comment information to obtain a second subject word set corresponding to the comment information; and determining target comment information similar to the text information of the short text from the comment information according to the similarity between the first subject word set and the second subject word set corresponding to each piece of comment information.
Optionally, the similarity between the first topic word set and the second topic word set is determined by: respectively determining first semantic distance information between each subject term in the second subject term set and the first subject term set; and determining second semantic distance information between the first subject word set and the second subject word set according to the plurality of first semantic distance information, wherein the second semantic distance information is used as the similarity.
Optionally, the first semantic distance information is determined by the following formula:
Figure BDA0002304830450000021
wherein, ω isjRepresenting the jth topic word in the second topic word set, ciRepresenting the ith subject word in the first set of subject words, C representing the first set of subject words,
Figure BDA0002304830450000022
representing first semantic distance information between a jth subject word in the second set of subject words and the first set of subject words, P (c)ij) Representing a conditional probability between an ith subject word in the first subject word set and a jth subject word in the second subject word set.
Optionally, before the selecting the subject term for each piece of the comment information, the method further includes: and deleting the comment information with the text length smaller than a preset text length threshold from the plurality of pieces of comment information in the short text.
Optionally, the determining a trending topic according to the target topic word set corresponding to each piece of the short text includes: inputting the target subject word set corresponding to each short text into a dynamic subject model to obtain the distribution probability of a plurality of topics in each target subject word set in each preset time within each preset time period; for each topic, determining the heat value of the topic in each preset time period according to the distribution probability of the topic in each target subject term set in the preset time period; for each topic, determining a total heat value of the topic according to the heat value of the topic in each preset time period; and determining the hot topics according to the total heat value of each topic.
Optionally, the determining, according to a distribution probability of the topic in each target topic word set in the preset time period, a heat value of the topic in each preset time period includes: determining the heat value according to the distribution probability by the following formula:
Figure BDA0002304830450000031
the determining the total heat value of the topic according to the heat value of the topic in each preset time period comprises the following steps: determining the total heat value according to the heat value by the following formula:
Figure BDA0002304830450000032
wherein the content of the first and second substances,
Figure BDA0002304830450000033
represents the heat value of the topic k in the t preset time period, MtRepresenting the number of the target subject word sets in the t-th preset time period, thetad,tRepresenting the distribution probability of the topic k in the d-th target subject term set in the T-th preset time period, T representing the total number of the preset time periods, and h (k) representing the total heat value of the topic k.
In a second aspect, the present disclosure provides a trending topic determination apparatus, the apparatus comprising: the acquisition module is used for acquiring text information of a plurality of short texts and a plurality of comment information under at least one short text; the expansion module is used for determining target comment information similar to the text information of the short text from the plurality of pieces of comment information under the short text aiming at least one piece of short text, and expanding the text information of the short text by utilizing the target comment information to obtain a target subject word set corresponding to the short text; and the determining module is used for determining the hot topics according to the target topic word sets corresponding to the short texts.
Optionally, the expansion module comprises: the first selection submodule is used for performing subject word selection on the text information of the short text to obtain a first subject word set corresponding to the short text; the second selection submodule is used for performing subject word selection on each piece of comment information to obtain a second subject word set corresponding to the comment information; and the determining submodule is used for determining target comment information similar to the text information of the short text from the comment information according to the similarity between the first subject word set and the second subject word set corresponding to each piece of comment information.
Optionally, the determining sub-module is configured to determine a similarity between the first topic word set and the second topic word set by: respectively determining first semantic distance information between each subject term in the second subject term set and the first subject term set; and determining second semantic distance information between the first subject word set and the second subject word set according to the plurality of first semantic distance information, wherein the second semantic distance information is used as the similarity.
Optionally, the expansion module further comprises: and the deleting submodule is used for deleting the comment information of which the text length is smaller than a preset text length threshold from the plurality of pieces of comment information under the short text before the second selecting submodule selects the subject word of each piece of comment information.
Optionally, the determining module includes: the input sub-module is used for inputting the target subject word set corresponding to each short text into a dynamic subject model to obtain the distribution probability of a plurality of topics in each target subject word set in each preset time within each preset time period; the hot degree value determining submodule is used for determining the hot degree value of each topic in each preset time period according to the distribution probability of the topic in each target subject word set in the preset time period; the total heat value determining sub-module is used for determining the total heat value of each topic according to the heat value of the topic in each preset time period; and the topic determining submodule determines the hot topics according to the total heat value of each topic.
In a third aspect, the present disclosure provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method provided by the first aspect of the present disclosure.
In a fourth aspect, the present disclosure provides an electronic device comprising: a memory having a computer program stored thereon; a processor for executing the computer program in the memory to implement the steps of the method provided by the first aspect of the present disclosure.
Through the technical scheme, when commenting on the short text, the commenter can generally explain around the theme of the short text, and the relevance between the published comment content and the content of the short text is high, so that the text information of the short text is expanded through the target comment information similar to the text information of the short text, the number of the feature words in the short text can be increased, and the problem of the sparsity of the short text is effectively solved. Meanwhile, the meaning of the target subject word set corresponding to the short text obtained after the expansion is enabled not to change compared with the meaning to be expressed by the short text. In addition, the target topic word set obtained after the short text expansion can provide accurate basis for determining the hot topics, so that the user can obtain the current actual public opinion hotspots through the hot topics.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure without limiting the disclosure. In the drawings:
fig. 1 is a flow diagram illustrating a trending topic determination method in accordance with an exemplary embodiment.
FIG. 2 is a flow diagram illustrating a method of determining targeted review information in accordance with an exemplary embodiment.
Fig. 3 is a flow diagram illustrating a method of determining similarity between a first topic word set and a second topic word set in accordance with an example embodiment.
Fig. 4 is a flowchart illustrating a method for determining trending topics from a target topic word set corresponding to each short text, according to an example embodiment.
Fig. 5 is a block diagram illustrating a trending topic determination apparatus in accordance with an exemplary embodiment.
FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.
FIG. 7 is a block diagram illustrating an electronic device in accordance with another example embodiment.
Detailed Description
The following detailed description of specific embodiments of the present disclosure is provided in connection with the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present disclosure, are given by way of illustration and explanation only, not limitation.
Fig. 1 is a flowchart illustrating a hot topic determination method according to an exemplary embodiment, which may be applied to an electronic device with data processing capability, such as a server, a terminal, and the like. The server may be, for example, a cloud server, a topic management server, or the like. The terminal may be, for example, a notebook computer, a desktop computer, or the like. As shown in fig. 1, the method may include S101-S103.
In S101, text information of a plurality of short texts and a plurality of pieces of comment information under at least one short text are acquired.
The short text is a text with a short length and has the characteristic of short refinement, such as microblog, each microblog generally limits the number of words within 140 words, and the short text also has expression forms of forum posts, chat messages, WeChat friend circles, news subjects, short mobile messages, document abstracts and the like. Because the short text content is less and the expression mode is more random, the short text is convenient to read and communicate, and in general, many users comment on the short text, such as comment on a microblog, comment on a forum post, comment on a WeChat friend circle, and the like.
The text information of the short text may refer to information such as text content in the short text, and the multiple pieces of comment information in the short text are multiple pieces of comment information for the short text.
In S102, for at least one short text, target comment information similar to the text information of the short text is determined from the comment information of the short text, and the text information of the short text is expanded by using the target comment information, so as to obtain a target topic word set corresponding to the short text.
Since the short text generally only contains a few to dozens of words with practical significance, the content for characterizing the short text is less, and therefore, the short text needs to be expanded when feature extraction is performed. At present, when a short text is expanded, the short text is generally expanded based on a predetermined word bank, but the predetermined word bank generally depends on manual editing, the characteristics are not comprehensive enough, and a theme originally expressed by the short text may be changed after the short text is expanded.
Taking the microblog as an example, after a publisher publishes a microblog, a user sees the microblog, and if the content in the microblog is interested, the user can comment on the microblog. Moreover, when commenting, the commenter can generally explain around the subject of the microblog, the published comment content has high correlation with the microblog content, and therefore the microblog and the comment information below the microblog have similar or same meanings.
Therefore, in the disclosure, the short text is expanded by using the comment information under the short text, so that not only can the problem of sparseness of the short text in the aspect of theme extraction be solved, but also the problem of change of the meaning of the short text after expansion can be effectively avoided.
For at least one short text, target comment information similar to the text information of the short text is determined from the comment information of the short text, wherein the similarity can be understood as meaning similarity or meaning similarity, for example, haze and air pollution are similar in meaning, and haze and a reference certificate are not similar in meaning. And then, the text information of the short text is expanded by using the target comment information, so that the number of the characteristic words in the text information is increased, and for the short text expanded by the target comment information, an expanded target subject word set corresponding to the short text can be obtained. Preferably, all short texts can be augmented by the target comment information of the short text.
In S103, a trending topic is determined according to the target topic word set corresponding to each short text.
Through the technical scheme, when commenting on the short text, the commenter can generally explain around the theme of the short text, and the relevance between the published comment content and the content of the short text is high, so that the text information of the short text is expanded through the target comment information similar to the text information of the short text, the number of the feature words in the short text can be increased, and the problem of the sparsity of the short text is effectively solved. Meanwhile, the meaning of the target subject word set corresponding to the short text obtained after the expansion is enabled not to change compared with the meaning to be expressed by the short text. In addition, the target topic word set obtained after the short text expansion can provide accurate basis for determining the hot topics, so that the user can obtain the current actual public opinion hotspots through the hot topics.
As shown in fig. 2, in S102 described above, a specific embodiment of determining target comment information similar to text information of the short text from among the plurality of pieces of comment information under the short text may include S201 to S203.
In S201, subject word selection is performed on the text information of the short text to obtain a first subject word set corresponding to the short text.
For example, the topic word selection can be performed by performing a word segmentation operation on the text information of the short text and removing stop words. And selecting the subject words of the short text A to obtain a first subject word set corresponding to the short text A.
It should be noted that, for a short text that is not expanded by the target comment information, the target topic word set corresponding to the short text is the first topic word set obtained by selecting the topic words. For the short text after being expanded through the target comment information, the target subject word set corresponding to the short text is the target subject word set obtained after being expanded.
In S202, subject word selection is performed on each piece of review information to obtain a second subject word set corresponding to the review information.
The manner of selecting the subject term may refer to S201, which is not described herein again.
In an embodiment, before the subject word of each piece of comment information is selected, the comment information with the text length smaller than the preset text length threshold may be deleted from the plurality of pieces of comment information in the short text, so that the subject word of each piece of comment information in S202 is selected, that is, the subject word of each piece of comment information remaining after deletion is selected.
The text length may represent the number of words or the number of characters in the comment information, and accordingly, the preset text length threshold may be set according to the number of words or the number of characters, for example, according to the number of words, and the preset text length threshold may be set to 5 words. The comment information with the text length smaller than the preset text length threshold value, namely the comment information with less than 5 words, has less comment content, and cannot express the thought to be explained by the reviewer, so that the comment information can be regarded as meaningless comment. In this embodiment, comment information having a text length smaller than a preset text length threshold is deleted from a plurality of pieces of comment information under the short text, with the aim of excluding meaningless comments.
For example, for a short text a, there are 10 pieces of comment information for the short text a, which are comments 1 to 10, respectively, for example, where the text lengths of the comments 1 and 2 are both smaller than a preset text length threshold, the comments 1 and 2 may be deleted, and the remaining comments 3 to 10 are subject word selected, respectively. For example, subject word selection is performed on the comment 3 to obtain a second subject word set corresponding to the comment 3; subject word selection is performed on the comment 4 to obtain a second set of subject words corresponding to the comment 4, and so on. It should be noted that the number of comments in the above embodiments is merely illustrative, and does not limit the present disclosure.
In S203, target comment information similar to the text information of the short text is determined from the comment information according to the similarity between the first topic word set and the second topic word set corresponding to each piece of comment information.
For example, the more similar the meanings between the first topic word set corresponding to the short text a and the second topic word set corresponding to the comment 3, the higher the similarity between the two, whereas the less similar the meanings between the first topic word set and the second topic word set.
In this step, as shown in fig. 3, the similarity between the first subject word set and the second subject word set may be determined through S2031 and S2032 in fig. 3.
In S2031, first semantic distance information between each subject word in the second subject word set and the first subject word set is determined, respectively.
For example, the second topic word set corresponding to the comment 3 includes a topic word a and a topic word b, and the first semantic distance information between the topic word a and the first topic word set may represent the similarity of the topic word a and the short text a in semantic expression.
In one embodiment, the smaller the first semantic distance information is, the more similar the semantic meaning of the subject word a and the short text a is to be expressed. In this embodiment, the first semantic distance information may be determined by the following equation (1):
Figure BDA0002304830450000101
specifically, ωjRepresenting the jth topic word in the second topic word set, ciRepresenting a first topicThe ith subject word in the set of words, C represents the first set of subject words,
Figure BDA0002304830450000102
representing first semantic distance information between the jth subject word in the second set of subject words and the first set of subject words, P (c)ij) Representing the conditional probability between the ith subject word in the first subject word set and the jth subject word in the second subject word set.
Wherein the conditional probability P (c)ij) Can be determined by the following equation (2):
Figure BDA0002304830450000103
P(ωj) The probability of the jth subject word in the second subject word set may be represented, for example, the ratio of the number of times that the jth subject word appears in the comment information corresponding to the second subject word set to the total number of words in the comment information may be represented.
P(cij) The joint probability of the ith subject word in the first subject word set and the jth subject word in the second subject word set may be represented, for example, the joint probability may be a ratio of the number of co-occurrences of the ith subject word and the jth subject word in the comment information corresponding to the second subject word set to which the jth subject word belongs to the total number of words in the comment information, where the number of co-occurrences is a smaller value of the number of occurrences of the ith subject word in the comment information and the number of occurrences of the jth subject word in the comment information. For example, if the frequency of occurrence of the ith subject word in the first subject word set in the comment information corresponding to the second subject word set to which the jth subject word belongs is 10 times, and the frequency of occurrence of the jth subject word in the comment information is 8 words, the frequency of co-occurrence may be 8 times.
In formula (1), first, a conditional probability between each subject word in the first subject word set and each subject word in the second subject word set is calculated, and an inverse of a maximum value among the plurality of conditional probabilities is determined as first semantic distance information. The larger the maximum value of the conditional probability is, the smaller the first semantic distance information is, and the larger the semantic relevance expressed by the jth subject word in the second subject word set and the first subject word set can be represented. When the maximum value of the conditional probability is 0, the first semantic distance information can be regarded as infinite, and the jth subject word in the second subject word set can be regarded as completely unrelated to the semantic meaning expressed by the first subject word set.
In another embodiment, the larger the first semantic distance information is, the more similar the semantic meaning that the subject word a in the second subject word set corresponding to the comment 3 is to be expressed by the short text a can be represented. In this embodiment, max { P (c) may be passed, for exampleij) Determine the first semantic distance information. Wherein, max { P (c)ij) The specific calculation of which is described in detail above.
In S2032, second semantic distance information between the first topic word set and the second topic word set is determined according to the plurality of first semantic distance information.
In this step, the second semantic distance information may be determined according to a sum of the plurality of first semantic distance information. Illustratively, for example, first semantic distance information between the subject word a and the first subject word set, and first semantic distance information between the subject word b and the first subject word set in the comment 3 are summed, and second semantic distance information between the first subject word set and a second subject word set corresponding to the comment 3 is determined according to the summation result.
After summing the plurality of first semantic distance information, the result of the summation may be normalized, and the result after the normalization is determined as the second semantic distance information. The second semantic distance information may be used as the similarity between the first subject term set and the second subject term set.
In this way, in S203 described above, the following two embodiments may be employed to specify the target comment information similar to the text information of the short text from the comment information.
First, if the first semantic distance information is determined according to the first embodiment provided in S2031, comment information with a similarity smaller than a preset first similarity threshold may be determined as the target comment information. The first similarity threshold may be calibrated in advance, and may be set to 0.6, for example. For example, if the similarity between the first topic word set and the second topic word set corresponding to the comment 10 is smaller than the preset similarity threshold, the comments 3 to 9 may be used as target comment information similar to the text information of the short text a.
Secondly, if the first semantic distance information is determined according to the second embodiment provided in S2031, comment information with a similarity greater than a preset second similarity threshold may be determined as the target comment information. The second similarity threshold may be pre-calibrated.
After the target comment information is determined, in the above S102, the manner of expanding the text information of the short text by using the target comment information may be:
and merging the second topic word set corresponding to the target comment information with the first topic word set to obtain a target topic word set corresponding to the short text.
For example, the second topic word sets corresponding to the comments 3 to 9 may be merged with the first topic word set corresponding to the short text a, and the merged content may be used as the target topic word set corresponding to the short text a.
Due to the characteristics of fast updating speed and easy diffusion of short texts, the trending topics dynamically change with time. In the present disclosure, when calculating the heat value of a topic, the dynamic change of the hot topic can be observed by time-phased calculation.
Fig. 4 is a flowchart illustrating a method for determining a trending topic according to a target topic word set corresponding to each short text according to an exemplary embodiment, and as shown in fig. 4, the method may include S401-S404.
In S401, the target topic word set corresponding to each short text is input into the dynamic topic model, and the distribution probability of each target topic word set of a plurality of topics in each preset time period is obtained.
The Dynamic Topic Models (DTMs) can perform clustering according to the meaning of the subject words, and the subject words with similar meanings are grouped into one category, and each category corresponds to one Topic. For example, the DTM model may group subject words such as "writing", "admission", "student", "test paper" into a category, and the topic corresponding to the category may be labeled "college entrance examination", for example.
And each short text has a time label and can mark the time when a certain short text is issued, and the DTM model can calculate the distribution probability of a plurality of topics in each target subject term set in each preset time period according to the time labels and the preset time periods.
Illustratively, taking the preset time period as one day as an example, for example, Monday is set as the first preset time period, M1The number of target topic word sets that represent monday, i.e., the number of short texts published on monday. Wherein, the distribution probability of the topic k in each target subject term set in the first preset time period may be M of the topic k on monday1Distribution probability in the subject term set of the item object.
The preset time period may be set in time units such as hours, and the present disclosure is not particularly limited.
In S402, for each topic, a heat value of the topic in each preset time period is determined according to a distribution probability of the topic in each target subject word set in the preset time period.
In the present disclosure, according to the distribution probability, the heat value may be determined by the following formula (3):
Figure BDA0002304830450000131
wherein the content of the first and second substances,
Figure BDA0002304830450000132
represents the heat value of the topic k in the t preset time period, MtRepresenting the number of the target subject word sets in the t-th preset time period, thetad,tIndicating that topic k is in the preset time within the t preset time periodThe distribution probability in the d-th target subject word set in the period.
In S403, for each topic, a total heat value of the topic is determined according to the heat value of the topic in each preset time period.
Wherein the total heat value of the topic can be determined by the following formula (4):
Figure BDA0002304830450000133
specifically, T represents the total number of preset time periods, and h (k) represents the total calorific value of topic k. The value of T may be set as needed, for example, if T may be set to 7 for a preset time period of one day, the total heat value may represent the total heat value of topic k from monday to sunday.
In S404, a trending topic is determined based on the total heat value of each topic.
After determining the total heat value of each topic, the total heat values can be ranked from high to low, and the top N topics are determined as trending topics, where N can be set as needed.
Therefore, according to the preset time period, the heat value of the topic in different time periods can be obtained through calculation in stages, and the dynamic change condition of the heat value of the topic in different time periods can be obtained.
Based on the same inventive concept, the present disclosure also provides a trending topic determination apparatus, and fig. 5 is a block diagram of a trending topic determination apparatus shown according to an exemplary embodiment, as shown in fig. 5, the apparatus 500 may include: an obtaining module 501, configured to obtain text information of multiple short texts and multiple comment information under at least one short text; an expansion module 502, configured to determine, for at least one short text, target comment information similar to text information of the short text from the multiple pieces of comment information in the short text, and expand the text information of the short text by using the target comment information to obtain a target subject word set corresponding to the short text; a determining module 503, configured to determine a trending topic according to the target topic word set corresponding to each short text.
By adopting the device, when commenting on the short text, the commenter can generally explain around the theme of the short text, and the relevance between the published comment content and the content of the short text is high, so that the text information of the short text is expanded through the target comment information similar to the text information of the short text, the number of the feature words in the short text can be increased, and the problem of sparsity of the short text is effectively solved. Meanwhile, the meaning of the target subject word set corresponding to the short text obtained after the expansion is enabled not to change compared with the meaning to be expressed by the short text. In addition, the target topic word set obtained after the short text expansion can provide accurate basis for determining the hot topics, so that the user can obtain the current actual public opinion hotspots through the hot topics.
Optionally, the expansion module 502 may include: the first selection submodule is used for performing subject word selection on the text information of the short text to obtain a first subject word set corresponding to the short text; the second selection submodule is used for performing subject word selection on each piece of comment information to obtain a second subject word set corresponding to the comment information; and the determining submodule is used for determining target comment information similar to the text information of the short text from the comment information according to the similarity between the first subject word set and the second subject word set corresponding to each piece of comment information.
Optionally, the determining sub-module may be configured to determine the similarity between the first topic word set and the second topic word set by: respectively determining first semantic distance information between each subject term in the second subject term set and the first subject term set; and determining second semantic distance information between the first subject word set and the second subject word set according to the plurality of first semantic distance information, wherein the second semantic distance information is used as the similarity.
Optionally, the expansion module 502 may further include: and the deleting submodule is used for deleting the comment information of which the text length is smaller than a preset text length threshold from the plurality of pieces of comment information under the short text before the second selecting submodule selects the subject word of each piece of comment information.
Optionally, the determining module 503 may include: the input sub-module is used for inputting the target subject word set corresponding to each short text into a dynamic subject model to obtain the distribution probability of a plurality of topics in each target subject word set in each preset time within each preset time period; the hot degree value determining submodule is used for determining the hot degree value of each topic in each preset time period according to the distribution probability of the topic in each target subject word set in the preset time period; the total heat value determining sub-module is used for determining the total heat value of each topic according to the heat value of the topic in each preset time period; and the topic determining submodule determines the hot topics according to the total heat value of each topic.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Fig. 6 is a block diagram illustrating an electronic device 600 according to an example embodiment. As shown in fig. 6, the electronic device 600 may include: a processor 601 and a memory 602. The electronic device 600 may also include one or more of a multimedia component 603, an input/output (I/O) interface 604, and a communications component 605.
The processor 601 is configured to control the overall operation of the electronic device 600 to complete all or part of the steps of the hot topic determination method. The memory 602 is used to store various types of data to support operation at the electronic device 600, such as instructions for any application or method operating on the electronic device 600 and application-related data, such as contact data, transmitted and received messages, pictures, audio, video, and so forth. The Memory 602 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk. The multimedia components 603 may include a screen and audio components. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 602 or transmitted through the communication component 605. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 604 provides an interface between the processor 601 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 605 is used for wired or wireless communication between the electronic device 600 and other devices. Wireless communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or a combination of one or more of them, which is not limited herein. The corresponding communication component 605 may therefore include: Wi-Fi module, Bluetooth module, NFC module, etc.
In an exemplary embodiment, the electronic Device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the above hot topic determination method.
In another exemplary embodiment, there is also provided a computer readable storage medium including program instructions which, when executed by a processor, implement the steps of the trending topic determination method described above. For example, the computer readable storage medium may be the memory 602 described above including program instructions executable by the processor 601 of the electronic device 600 to perform the trending topic determination method described above.
Fig. 7 is a block diagram illustrating an electronic device 700 in accordance with another example embodiment. For example, the electronic device 700 may be provided as a server. Referring to fig. 7, an electronic device 700 includes a processor 722, which may be one or more in number, and a memory 732 for storing computer programs that are executable by the processor 722. The computer programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processor 722 may be configured to execute the computer program to perform the trending topic determination method described above.
Additionally, the electronic device 700 may also include a power component 726 that may be configured to perform power management of the electronic device 700 and a communication component 750 that may be configured to enable communication, e.g., wired or wireless communication, of the electronic device 700. The electronic device 700 may also include input/output (I/O) interfaces 758. The electronic device 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OSXTM, UnixTM, LinuxTM, and the like.
In another exemplary embodiment, there is also provided a computer readable storage medium including program instructions which, when executed by a processor, implement the steps of the trending topic determination method described above. For example, the computer readable storage medium may be the memory 732 described above including program instructions that are executable by the processor 722 of the electronic device 700 to perform the trending topic determination method described above.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned trending topic determination method when executed by the programmable apparatus.
The preferred embodiments of the present disclosure are described in detail with reference to the accompanying drawings, however, the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solution of the present disclosure within the technical idea of the present disclosure, and these simple modifications all belong to the protection scope of the present disclosure.
It should be noted that, in the foregoing embodiments, various features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various combinations that are possible in the present disclosure are not described again.
In addition, any combination of various embodiments of the present disclosure may be made, and the same should be considered as the disclosure of the present disclosure, as long as it does not depart from the spirit of the present disclosure.

Claims (10)

1. A method for determining a trending topic, the method comprising:
acquiring text information of a plurality of short texts and a plurality of comment information under at least one short text;
aiming at least one piece of short text, determining target comment information similar to the text information of the short text from the plurality of pieces of comment information under the short text, and expanding the text information of the short text by utilizing the target comment information to obtain a target subject word set corresponding to the short text;
and determining the hot topics according to the target topic word sets corresponding to each short text.
2. The method of claim 1, wherein the determining, from the plurality of pieces of comment information under the short text, target comment information similar to the text information of the short text comprises:
selecting a subject word from the text information of the short text to obtain a first subject word set corresponding to the short text;
performing subject word selection on each piece of comment information to obtain a second subject word set corresponding to the comment information;
and determining target comment information similar to the text information of the short text from the comment information according to the similarity between the first subject word set and the second subject word set corresponding to each piece of comment information.
3. The method of claim 2, wherein the similarity between the first topic word set and the second topic word set is determined by:
respectively determining first semantic distance information between each subject term in the second subject term set and the first subject term set;
and determining second semantic distance information between the first subject word set and the second subject word set according to the plurality of first semantic distance information, wherein the second semantic distance information is used as the similarity.
4. The method of claim 3, wherein the first semantic distance information is determined by the following formula:
Figure FDA0002304830440000021
wherein, ω isjRepresenting the jth topic word in the second topic word set, ciRepresenting the ith subject word in the first set of subject words, C representing the first set of subject words,
Figure FDA0002304830440000022
representing first semantic distance information between a jth subject word in the second set of subject words and the first set of subject words, P (c)ij) Representing a condition between an ith subject word in the first subject word set and a jth subject word in the second subject word setProbability.
5. The method according to claim 2, wherein before the subject term selection for each piece of the review information, the method further comprises:
and deleting the comment information with the text length smaller than a preset text length threshold from the plurality of pieces of comment information in the short text.
6. The method as claimed in claim 1, wherein the determining a trending topic according to the target topic word set corresponding to each of the short texts comprises:
inputting the target subject word set corresponding to each short text into a dynamic subject model to obtain the distribution probability of a plurality of topics in each target subject word set in each preset time within each preset time period;
for each topic, determining the heat value of the topic in each preset time period according to the distribution probability of the topic in each target subject term set in the preset time period;
for each topic, determining a total heat value of the topic according to the heat value of the topic in each preset time period;
and determining the hot topics according to the total heat value of each topic.
7. The method as claimed in claim 6, wherein said determining the heat value of the topic in each of the preset time periods according to the distribution probability of the topic in each of the target topic word sets in the preset time period comprises:
determining the heat value according to the distribution probability by the following formula:
Figure FDA0002304830440000031
the determining the total heat value of the topic according to the heat value of the topic in each preset time period comprises the following steps:
determining the total heat value according to the heat value by the following formula:
Figure FDA0002304830440000032
wherein the content of the first and second substances,
Figure FDA0002304830440000033
represents the heat value of the topic k in the t preset time period, MtRepresenting the number of the target subject word sets in the t-th preset time period, thetad,tRepresenting the distribution probability of the topic k in the d-th target subject term set in the T-th preset time period, T representing the total number of the preset time periods, and h (k) representing the total heat value of the topic k.
8. A trending topic determination apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring text information of a plurality of short texts and a plurality of comment information under at least one short text;
the expansion module is used for determining target comment information similar to the text information of the short text from the plurality of pieces of comment information under the short text aiming at least one piece of short text, and expanding the text information of the short text by utilizing the target comment information to obtain a target subject word set corresponding to the short text;
and the determining module is used for determining the hot topics according to the target topic word sets corresponding to the short texts.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
10. An electronic device, comprising:
a memory having a computer program stored thereon;
a processor for executing the computer program in the memory to carry out the steps of the method of any one of claims 1 to 7.
CN201911235739.XA 2019-12-05 2019-12-05 Hot topic determination method and device, storage medium and electronic equipment Pending CN111125305A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911235739.XA CN111125305A (en) 2019-12-05 2019-12-05 Hot topic determination method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911235739.XA CN111125305A (en) 2019-12-05 2019-12-05 Hot topic determination method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN111125305A true CN111125305A (en) 2020-05-08

Family

ID=70497874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911235739.XA Pending CN111125305A (en) 2019-12-05 2019-12-05 Hot topic determination method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN111125305A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783468A (en) * 2020-06-28 2020-10-16 百度在线网络技术(北京)有限公司 Text processing method, device, equipment and medium
CN113656580A (en) * 2021-08-12 2021-11-16 北京锐安科技有限公司 Method, device, equipment and medium for identifying spam comments
CN115062586A (en) * 2022-08-08 2022-09-16 山东展望信息科技股份有限公司 Hot topic processing method based on big data and artificial intelligence

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
US20140019119A1 (en) * 2012-07-13 2014-01-16 International Business Machines Corporation Temporal topic segmentation and keyword selection for text visualization
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs
KR101613397B1 (en) * 2015-05-29 2016-04-18 한국과학기술원 Method and apparatus for associating topic data with numerical time series
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method
CN109242534A (en) * 2018-08-07 2019-01-18 桂林电子科技大学 A kind of user's score in predicting method based on user comment dynamic analysis
CN109271514A (en) * 2018-09-14 2019-01-25 华南师范大学 Generation method, classification method, device and the storage medium of short text disaggregated model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140019119A1 (en) * 2012-07-13 2014-01-16 International Business Machines Corporation Temporal topic segmentation and keyword selection for text visualization
CN103390051A (en) * 2013-07-25 2013-11-13 南京邮电大学 Topic detection and tracking method based on microblog data
CN103745000A (en) * 2014-01-24 2014-04-23 福州大学 Hot topic detection method of Chinese micro-blogs
KR101613397B1 (en) * 2015-05-29 2016-04-18 한국과학기술원 Method and apparatus for associating topic data with numerical time series
CN107992549A (en) * 2017-11-28 2018-05-04 南京信息工程大学 Dynamic short text stream Clustering Retrieval method
CN109242534A (en) * 2018-08-07 2019-01-18 桂林电子科技大学 A kind of user's score in predicting method based on user comment dynamic analysis
CN109271514A (en) * 2018-09-14 2019-01-25 华南师范大学 Generation method, classification method, device and the storage medium of short text disaggregated model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张萌: "微博热点话题发现方法的研究和实现", no. 06, pages 3 - 5 *
曹丽娜 等: "基于主题模型的BBS话题演化趋势分析", 《管理科学学报》, vol. 17, no. 11, 30 November 2014 (2014-11-30), pages 4 *
郑飞 等: "基于分类的中文微博热点话题发现方法研究", 《信息网络安全》, no. 09, 30 September 2014 (2014-09-30), pages 127 - 131 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783468A (en) * 2020-06-28 2020-10-16 百度在线网络技术(北京)有限公司 Text processing method, device, equipment and medium
CN111783468B (en) * 2020-06-28 2023-08-15 百度在线网络技术(北京)有限公司 Text processing method, device, equipment and medium
CN113656580A (en) * 2021-08-12 2021-11-16 北京锐安科技有限公司 Method, device, equipment and medium for identifying spam comments
CN115062586A (en) * 2022-08-08 2022-09-16 山东展望信息科技股份有限公司 Hot topic processing method based on big data and artificial intelligence

Similar Documents

Publication Publication Date Title
US11074309B2 (en) Text-to-media indexes on online social networks
US10832008B2 (en) Computerized system and method for automatically transforming and providing domain specific chatbot responses
CN110869969B (en) Virtual assistant for generating personalized responses within a communication session
CN107256267B (en) Query method and device
CN110892395B (en) Virtual assistant providing enhanced communication session services
CN107832433B (en) Information recommendation method, device, server and storage medium based on conversation interaction
CN106960030B (en) Information pushing method and device based on artificial intelligence
US20170220677A1 (en) Quotations-Modules on Online Social Networks
US20150006148A1 (en) Automatically Creating Training Data For Language Identifiers
US20170220579A1 (en) Mentions-Modules on Online Social Networks
CN111125305A (en) Hot topic determination method and device, storage medium and electronic equipment
US9524320B2 (en) Collection and storage of a personalized, searchable, unstructured corpora
US20130275438A1 (en) Disambiguating authors in social media communications
CN108304424B (en) Text keyword extraction method and text keyword extraction device
US9811515B2 (en) Annotating posts in a forum thread with improved data
RU2670029C2 (en) System and method of automatic message moderation
CN111552797B (en) Name prediction model training method and device, electronic equipment and storage medium
JP2019091450A (en) Method and system for providing real-time feedback information related to content of user input
CN110019948B (en) Method and apparatus for outputting information
CN107634897A (en) Group recommends method and apparatus
CN114631094A (en) Intelligent e-mail headline suggestion and remake
CN110245357B (en) Main entity identification method and device
CN111538830A (en) French retrieval method, French retrieval device, computer equipment and storage medium
CN112182255A (en) Method and apparatus for storing media files and for retrieving media files
CN114298007A (en) Text similarity determination method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination