CN110245355B - Text topic detection method, device, server and storage medium - Google Patents

Text topic detection method, device, server and storage medium Download PDF

Info

Publication number
CN110245355B
CN110245355B CN201910549752.6A CN201910549752A CN110245355B CN 110245355 B CN110245355 B CN 110245355B CN 201910549752 A CN201910549752 A CN 201910549752A CN 110245355 B CN110245355 B CN 110245355B
Authority
CN
China
Prior art keywords
topic
central
word
text
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910549752.6A
Other languages
Chinese (zh)
Other versions
CN110245355A (en
Inventor
张国校
李铮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Domain Computer Network Co Ltd
Original Assignee
Shenzhen Tencent Domain Computer Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Tencent Domain Computer Network Co Ltd filed Critical Shenzhen Tencent Domain Computer Network Co Ltd
Priority to CN201910549752.6A priority Critical patent/CN110245355B/en
Publication of CN110245355A publication Critical patent/CN110245355A/en
Application granted granted Critical
Publication of CN110245355B publication Critical patent/CN110245355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application discloses a text topic detection method, a text topic detection device, a server and a storage medium, wherein in the method, a central topic word of network text data is determined; calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups; determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value, so that irrelevant topic items are filtered, and the detection accuracy of text topics is improved.

Description

Text topic detection method, device, server and storage medium
Technical Field
The present application relates to the field of internet information management technologies, and in particular, to a text topic detection method, a device, a server, and a storage medium.
Background
The information sharing platforms such as forum, microblog and bar have the characteristic of high openness, and users can issue information on the information sharing platform based on personal interests and habits, so that text information issued by tens of millions of users can be generated on the Internet every day. When a major event occurs on a society or a network, discussions about a specific topic on all information sharing platforms suddenly increase, and the major event corresponds to the topic and has the characteristics of strong burst and high topic dispersion.
By monitoring the discussion topics of the users, related personnel can be helped to timely find the topics of the negative message types and take related remedial measures. For example, when a certain game suddenly appears a payment bug, the discussion about the payment bug in the bar of the game is obviously increased, and an operator can timely find out a problem through monitoring the sudden topic to inform maintenance personnel to repair the payment bug.
In the related art, a generated probability model is generally adopted to generate related word pairs for each piece of comment data, and a sudden topic is generated according to word pairs with word pair increment larger than a preset value in a target time period. However, when there are a few users publishing text contents of a large number of repeated irrelevant topic items on a specific information sharing platform, the above related technology cannot filter the irrelevant topic items, resulting in lower detection accuracy of the text topics.
Therefore, how to filter out irrelevant topic items and improve the detection accuracy of text topics is a technical problem that needs to be solved by those skilled in the art at present.
Disclosure of Invention
In view of this, the present application provides a text topic detection method, a text topic detection device, a server and a storage medium, which can filter out irrelevant topic items and improve the detection accuracy of text topics.
In order to achieve the above object, a first aspect of the present application provides a text topic detection method, including:
determining a central topic word of the web text data;
calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups;
determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value.
With reference to the first aspect of the present application, in a first implementation manner of the first aspect of the present application, setting topic word candidates meeting a preset condition in the topic cluster as the central topic word includes:
calculating the burst level of each topic word candidate item according to the word frequency information and the fluctuation index of the topic word candidate item;
and setting the topic word candidate item with the highest burst level in the topic cluster as the central topic word.
With reference to the first implementation manner of the first aspect of the present application, in a second implementation manner of the first aspect of the present application, determining the text topic according to the target center topic word includes:
Setting a topic cluster in which all the topic dispersion is larger than the central topic words of the first preset value as a target topic cluster, and determining the text topic according to all the target topic clusters.
With reference to the second implementation manner of the first aspect of the present application, in a third implementation manner of the first aspect of the present application, determining the text topic according to all the target topic clusters includes:
calculating a first co-occurrence rate of a central topic word and each non-central topic word of each target topic cluster;
setting the central topic word and the non-central topic word with the first co-occurrence rate larger than a second preset value as target topic words;
and determining the text topic according to all the target topic words.
With reference to the third implementation manner of the first aspect of the present application, in a fourth implementation manner of the first aspect of the present application, calculating a first co-occurrence rate of a central topic word and each non-central topic word of each target topic cluster includes:
determining text sub-data corresponding to the target topic cluster in the network text data;
setting the quantity ratio of the co-occurrence sentences to all sentences in the text sub-data as the first co-occurrence rate; wherein the co-occurrence sentence is a sentence in which each of the non-center topic words and the center topic words are included in the text sub-data.
With reference to the third implementation manner of the first aspect of the present application, in a fifth implementation manner of the first aspect of the present application, before determining the text topic according to all the target topic words, the method further includes:
setting the non-central topic words with the first co-occurrence rate smaller than or equal to the second preset value and the burst level larger than the third preset value as new central topic words;
and calculating a second co-occurrence rate of the new central topic word and each non-new central topic word of each target topic cluster, and setting the non-new central topic word with the second co-occurrence rate larger than the second preset value as the target topic word.
With reference to the second implementation manner of the first aspect of the present application, the third implementation manner of the first aspect of the present application, the fourth implementation manner of the first aspect of the present application, and the fifth implementation manner of the first aspect of the present application, in a sixth implementation manner of the first aspect of the present application, setting topic word candidates meeting preset conditions in the topic cluster as the central topic word includes:
calculating the burst level of each topic word candidate item according to the word frequency information and the fluctuation index of the topic word candidate item;
And setting the topic word candidate item with the highest burst level in the topic cluster as the central topic word.
To achieve the above object, a second aspect of the present application provides a text topic detection device, including:
the central topic word determining module is used for determining the central topic word of the web text data;
the dispersion calculating module is used for calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups;
the topic determination module is used for determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value.
To achieve the above object, a third aspect of the present application provides a server, including:
a processor and a memory;
wherein the processor is configured to execute a program stored in the memory;
the memory is used for storing a program, and the program is used for at least:
determining a central topic word of the web text data;
calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups;
Determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value.
To achieve the above object, a fourth aspect of the present application provides a storage medium having stored therein computer-executable instructions which, when loaded and executed by a processor, implement the steps of the text topic detection method as described in any one of the above.
It can be seen that the present application, after determining the center topic words of the web text data, determines the topic dispersion of the center topic words. The topic dispersion can represent the distribution condition of the central topic words in the network text data corresponding to each topic discussion group, the central topic words are screened based on the topic dispersion, the text topics are determined through the central topic words with the topic dispersion higher than a first preset value, and the situation of text topic misjudgment caused by a large number of repeated list specific contents of individual users in a specific text data distribution community can be reduced. Therefore, the method and the device can filter out irrelevant topic items and improve the detection accuracy of the text topics.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 illustrates a schematic diagram of a component architecture of a text topic detection system in accordance with an embodiment of the present application;
fig. 2 shows a flow chart of a text topic detection method according to an embodiment of the present application;
FIG. 3 shows a schematic diagram of discussion group text data partitioning provided by an embodiment of the present application;
FIG. 4 illustrates a central topic word determination diagram provided by an embodiment of the present application;
FIG. 5 shows a flow diagram of another text topic detection method in an embodiment of the present application;
FIG. 6 shows a flow diagram of yet another text topic detection method in accordance with an embodiment of the present application;
FIG. 7 shows a clustering schematic diagram of a Louvain community partitioning algorithm according to an embodiment of the present application;
fig. 8 is a schematic diagram showing the composition structure of a text topic detection device according to an embodiment of the present application;
fig. 9 shows a schematic diagram of a composition structure of a server according to an embodiment of the present application.
Detailed Description
According to the scheme, irrelevant topic items can be filtered in the process of detecting network text data topics such as microblog topics and bar-pasting topics, the detection accuracy of the text topics is improved, and discussion contents of the whole network users are accurately mastered.
In this embodiment of the present application, the web text data is information posted by the user or the authority on an information sharing platform such as a microblog, a bar, a forum, etc., where the web text data includes content posted by the user or the authority itself, and comment content and forwarding content of the user or the authority to other users.
In this embodiment, the central term of the web text data is a term mainly discussed or mentioned in the web text data, and the central term may be a term frequently appearing in the web text data, or may be a term capable of summarizing a term frequently appearing in the web text data, for example, the "potato", "sweet potato", "taro" and "radish" in the web text data may have a high frequency of occurrence, and the "rhizome vegetables" may be used as the central term.
In this embodiment, the discussion group text data is divided by topic discussion groups from all web text data. A topic discussion group is a group in which one or more users commonly post text data, for example, a main post and its posting under the same post can be regarded as one topic discussion group, all groups on a microblog that post comments about a topic can be regarded as one topic discussion group, and all comments under a short video can be regarded as one topic discussion group. Of course, a plurality of topic discussion groups may be further included in one topic discussion group, and the measurement units of the topic discussion groups are not limited in this embodiment, a certain bar may be regarded as a topic discussion group as a whole, or a specific main paste and its reply in a certain bar may be regarded as a topic discussion group, which is not specifically limited in this embodiment. For example, when a main callback structure is regarded as a topic discussion group, multiple groups of data obtained by dividing all network text data according to main callback identifications can be obtained. For example, the web text data includes a1, a2, a3, b1, b2, c1, c2, and c3, where a1, b1, and c1 are two different main posts, a2 and a3 are posts to a1, b2 is posts to b1, c2 and c3 are posts to c1, and dividing a1, a2, a3, b1, b2, c1, c2, and c3 according to the main post structures may result in three main post structures, a first main post structure including a1, a2, and a3, a second main post structure including b1 and b2, and a third main post structure including c1, c2, and c3.
In this embodiment, topic dispersion refers to the degree of dispersion of the central topic word in all discussion group text data, and can be used to describe the distribution of the central topic word in the network. Continuing to describe the topic dispersion using the above example in which the web text data includes a1, a2, a3, b1, b2, c1, c2, and c3, if the topic dispersion of the central topic word a needs to be determined, the word frequency of the central topic word a in the first set of main post structures a1, a2, and a3, the word frequency of the central topic word a in the second set of main post structures b1 and b2, and the word frequency of the central topic word a in the third set of main post structures c1, c2, and c3 may be counted. If the difference of the word frequencies of the central thematic word A in the three groups of main loop structures is in a certain range, the dispersion degree of the central thematic word A is higher, otherwise, the dispersion degree of the central thematic word A is lower. Each topic discussion group can be regarded as a room, and when the topic dispersion of the central topic words is high, the topics about the central topic words are discussed by most people in the room; when the topic dispersion of the center topic word is low, it is stated that only a small portion of the people in the room are discussing the topic about the center topic word, and the topic of the center topic word cannot represent the topic discussed by most people. In this embodiment, the text topic is what is mainly discussed in the network text data, and the text topic may include one or more words.
In order to facilitate understanding of the text topic detection method of the present application, a system to which the text topic detection method of the present application is applied is described below. Referring to fig. 1, a schematic diagram of a composition architecture of a text topic detection system in accordance with an embodiment of the present application is shown.
As shown in fig. 1, the text topic detection system provided in the embodiment of the present application includes: the data acquisition device 10, the server 20 and the client 30 are in communication connection through the network 40.
The data acquisition device 10 may be a server of an information sharing platform such as a bar, a microblog, etc., and the number of the data acquisition devices 10 in the text topic detection system may be multiple, so as to realize topic detection on multiple information sharing platforms. The server 20 may be a server device such as a tablet computer or a personal computer for analyzing web text data. The client 30 may be a server device such as a mobile phone, a tablet computer, or a personal computer, and is configured to receive and display the text topic detection result of the server 20. In other text topic detection system component architectures, the server 20 and the client 30 may be the same device.
In the embodiment of the present application, the data collection device 10 may collect and aggregate web text data on the information sharing platform, where the web text data may include a main post or a reply post issued by a user and information issued by an official authority of the information sharing platform. Since the topic generation requires a certain time, the server 20 may select the network text data within a period of time to perform the text topic detection operation, so as to obtain the text topic corresponding to the period of time. After obtaining the text topic, the server 20 uploads the text topic to the client 30 through the network, so that relevant staff can know topic contents generally discussed by the user on the information sharing platform in the period of time.
The text topic detection process of the server will be described in detail below.
Referring to fig. 2, which is a schematic flow chart of a text topic detection method according to an embodiment of the present application, the method of the present embodiment may include:
s101, a server determines a central topic word of web text data;
the network text data may include information issued by users and/or authorities on any number of information sharing platforms, and since the generation of topics requires a certain time and topics generally have timeliness, the network text data referred to in this embodiment may be information generated in a specific time period, and a time difference between the issue time and the current time of the network text data is smaller than a preset time difference.
As a possible implementation, the web text data generated in each day may be detected in time units of day topic detection. As another possible implementation manner, all network text data within 1 hour before the current moment can be used as topic detection objects, so that the current topic content of each information sharing platform can be known in time.
Sources of the web text data in this embodiment may include, but are not limited to, information sharing platforms such as microblogs, bar sticks, news websites, video websites, and the like, as long as comment data of a user for an event exists. It will be appreciated that since the purpose of determining the topic of text is to assist the relevant person in understanding the feedback opinion of a person about a particular viewing object, for example, the particular viewing object may be a piece of application software, a network game, a design drawing, typhoon weather or a spring ticket, etc. In order to improve the effect of obtaining the feedback opinion of the user, the source of the network text data can be set as an information sharing platform related to the object. For example, when it is desired to learn the feedback opinion of the user of network game A, web text data may be obtained from the bar or forum of network game A, thereby determining the trending topics currently being discussed. When the related personnel find out that the popular term of the network game A is the price bug of the game mall, the price bug of the game mall can be reported in time so as to repair the bug.
It should be noted that, the central topic word mentioned in this embodiment is the core content of the discussed topic in the web text data, and there may be various ways to determine the central topic word in the web text data, for example: the occurrence frequency of each word of the network text data can be ordered from high to low, and the words with the first M names of the occurrence frequency are used as central topic words; words with occurrence frequency greater than a preset frequency in the web text data can be set as central topic words. As a possible implementation manner, semantic analysis can be performed according to words with high occurrence frequency in the web text data to obtain central topic words which can summarize words in each clustering result. For example, apples, oranges and watermelons can be classified as fruits when the occurrence frequency of apples, oranges and watermelons in the web text data is high, and thus the central term of the web text data is fruits. Furthermore, in the process of selecting the central topic words, the words with the change rate of the occurrence frequency larger than the preset change rate in the target time period can be used as the central topic words by referring to the historic fluctuation condition of each word. Of course, other ways of selecting the central topic word may exist in this embodiment, as long as the word that can represent the core content of the topic under discussion in the web text data can be selected, and this embodiment is not limited specifically.
S102, the server calculates topic dispersion of the central topic word in all discussion group text data;
where the text topics of the web text data are typically presented in the form of comments, the web text data corresponding to a topic discussion group may be a collection of text data that discusses a particular topic. The embodiment can divide all network text data into a plurality of discussion group text data according to topic discussion groups so as to determine topic dispersion of a central topic in all discussion group text data.
The discussion group text data in this embodiment is obtained by dividing the web text data according to topic discussion groups, please refer to fig. 3, which shows a schematic diagram of discussion group text data division provided in the embodiment of the present application, and the process of dividing the web text data according to a main callback structure in fig. 3 to obtain discussion group text data is illustrated below. For example: in the forum A, 10 pieces of network text data published by users exist, each piece of network text data is numbered, the network text data with the numbers of 1, 5 and 8 in all pieces of network text data are used as main posts, and the rest pieces of network text data are used as postbacks. According to the main back paste identification, the back paste corresponding to the main paste with the number 1 is the network text data with the numbers 2, 3 and 4, the back paste corresponding to the main paste with the number 5 is the network text data with the numbers 6 and 7, and the back paste corresponding to the main paste with the number 8 is the network text data with the numbers 9 and 10. The network text data from the number 1 to the number 4 are used as a group of discussion group text data, the network text data from the number 5 to the number 7 are used as a group of discussion group text data, and the network text data from the number 8 to the number 10 are used as a group of discussion group text data, so that three groups of discussion group text data can be obtained by dividing all the network text data according to topic discussion groups, and the same discussion group text data belongs to the same topic discussion group, namely a main back paste structure.
It is to be understood that the topic dispersion mentioned in the present embodiment is a value describing the distribution of the center topic in all discussion group text data. When the topic dispersion of the center topic word X is high, the description center topic word is frequently lifted in a plurality of topic discussion groups, not only in one specific topic discussion group, but also at a higher frequency. The above process of determining topic dispersion is illustrated: if a plurality of topic discussion groups exist in a forum, word frequency information of the center topic words can be determined according to the occurrence frequency of the center topic words in text data of each topic discussion group and the total word quantity of each topic discussion group, and when the deviation of the word frequency information of the center topic words in all topic discussion groups is smaller than a preset deviation quantity, the topic dispersion of the center topic words is higher; when the deviation of the word frequency information of the central topic words in all topic discussion groups is larger than or equal to the preset deviation amount, the topic dispersion degree of the central topic words is lower.
S103, the server determines a text topic according to the target center topic word;
the web text data corresponding to the same topic discussion group can be a discussion set of specific topics, so that the popularity of a central topic word in discussion can be determined by calculating topic dispersion of the central topic word in all the discussion group text data. If the dispersion of a central topic word is low, it means that most topic discussion groups do not use the central topic word as the core content of the discussion.
In the step, the target central topic words are central topic words with the topic dispersion larger than a first preset value. And determining the related operation of the text topic by using the central topic words with topic dispersion larger than the first preset value, which is equivalent to filtering the central topic words according to the topic dispersion and removing the central topic words with topic dispersion smaller than the first preset value. It should be noted that, the central topic words with topic dispersion lower than the first preset value may be regarded as contents not discussed by most topic discussion groups, i.e. irrelevant topic items.
In the information sharing platform, there may be a case where a few users repeatedly publish the same or similar text data in one topic discussion group, for example, bar watering, topic frying or advertisement marketing, etc., which may cause some words to be selected as central topic words, but the central topic words may cause misjudgment of text topics because the central topic words are not the main discussion contents in all topic discussion groups. Referring to fig. 4, a schematic diagram of determining a central keyword provided in an embodiment of the present application is shown, which illustrates the description of fig. 4, for example, there are 5 topic discussion groups A, B, C, D and E in a certain health forum, and three central keywords are determined according to the frequency of occurrence of words in the web text data corresponding to the 5 topic discussion groups: "winter", "keep warm" and "some antihypertensive drug". The frequencies of words in the 5 topic discussion groups A, B, C, D and E of the central keyword 'winter' are 21%, 22%, 19%, 18% and 5%, respectively, the frequencies of words in the 5 topic discussion groups A, B, C, D and E of the central keyword 'warm' are 20%, 23%, 21%, 24% and 6%, respectively, and the frequencies of words in the 5 topic discussion groups A, B, C, D and E of the central keyword 'certain antihypertensive agent' are 1%, 0%, 2%, 4% and 56%, respectively, so that the topic dispersion of the keywords 'winter' and 'warm' is obviously better than that of certain antihypertensive agent, the text topics in the health preserving jar are related to 'winter' and 'warm', while 'certain antihypertensive agent' is not the main discussion content in the health preserving jar, and the network text content corresponding to the topic discussion group E of the health preserving jar may be information published for the purpose of promoting products.
In this embodiment, the topic dispersion is used to screen the central topic words, and the text topic is determined according to the central topic words with the topic dispersion larger than the first preset value. As a possible implementation, the text topic may be determined by integrating all the center topic words and the categories of the center topic words. Of course, the number of the central topic words with the topic dispersion larger than the first preset value is limited, and the text topics possibly cannot completely express the contents discussed by the network text data only according to the central topic words, so that the text topics can be comprehensively determined by combining words related to the central topic words in the network text data on the basis of determining the central topic words with the topic dispersion larger than the first preset value. For example, the central topic word with the topic dispersion larger than the first preset value in all discussion group text data is "stop update", and 60% of sentences including stop update in all network text data include another word "compensation", so that the text topic of the current network text data is known as: and (5) stopping the machine for updating and compensating the machine. Of course, other ways of determining text topics in combination with the central topic word may also be included, which is not limited specifically herein, and filtering of irrelevant topic items is achieved by selecting the topic dispersion of the central topic word to be greater than the first preset value.
S104, the client displays the text topics detected by the server.
After the central topic words of the web text data are determined, the topic dispersion of the central topic words is determined. The topic dispersion can represent the distribution condition of the central topic words in the network text data corresponding to each topic discussion group, and if the topic dispersion is high, the central topic words are frequently discussed in most topic discussion groups corresponding to the network text data; if the topic dispersion is low, it is stated that the central topic word is discussed less frequently in all topic discussion groups corresponding to the web text data or only in individual topic discussion groups. The central topic words are the central words of the topics to be discussed in the network text data, the central topic words are screened based on topic dispersion, the text topics are determined through the central topic words with the topic dispersion higher than a first preset value, and the situation of text topic misjudgment caused by a large number of repeated list specific contents of a specific text data issuing community by individual users can be reduced. Therefore, the method and the device can filter out irrelevant topic items and improve the detection accuracy of text topics.
Referring to fig. 5, which is a schematic flow chart of another text topic detection method according to an embodiment of the present application, the method of the present embodiment may include:
S201, determining topic word candidates in the network text data, and performing clustering operation based on a main posting structure on the topic word candidates to obtain topic clusters;
in this embodiment, word division is performed on the network text data, topic word candidates are initially selected according to the occurrence frequency and lexical information of each word, and clustering operation based on topic discussion groups is performed on all topic word candidates to obtain a plurality of topic clusters. All topic word candidates within the same topic cluster correspond to the same topic discussion group, and the network text data corresponding to the topic cluster referred to herein corresponds to the discussion group text data referred to in the corresponding embodiment of fig. 2.
In this embodiment, the topic word candidates may include a unitary keyword, may include a plurality of ordered frequent items, and may include both a unitary keyword and a plurality of ordered frequent items. The key word is a word, and the ordered frequent item is an ordered combination of a plurality of words. In particular, the ordered frequent items may be composed of a plurality of words in a sentence, wherein the order of the words is kept constant, and the number of words included in the ordered set of frequent items is not limited herein. For example, among "system bug", "system bug" and "system bug", the "system" and "bug" constitute an ordered set of frequent items.
As a possible implementation manner, the following operations may be included in the present embodiment S201: setting keywords in the web text data as topic word candidates; wherein the occurrence frequency of the keywords is larger than a fourth preset value; and/or setting ordered frequent items in the web text data as topic word candidates; the ordered frequent items are a plurality of words with fixed sequence in the target sentence pattern, and the occurrence frequency of the target sentence pattern in the network text data is larger than a fifth preset value.
In the above-described possible embodiment, the ordered frequent items are set as topic candidate words by setting the keywords whose occurrence frequency is greater than the fourth preset value as topic candidate words. In this embodiment, the sentence format including the ordered frequent item is used as the target sentence pattern, and the number of occurrences of the target sentence pattern in which the ordered frequent item is located is greater than the fifth preset value. As a possible implementation, the operations of importance scoring and filtering can also be done with boundary freedom after the preliminary selection of ordered frequent items.
S202, setting topic word candidates meeting preset conditions in the topic cluster as central topic words.
In this embodiment, the central topic word is set in each topic cluster, and since one topic cluster corresponds to one topic discussion group, this step is equivalent to selecting the central topic word for the web text data corresponding to each topic discussion group. In this way, it is possible to avoid a situation in which topics discussed by some topic discussion groups are ignored because text data included in the topic discussion groups is small. The present embodiment does not limit the number of central topic words corresponding to each topic cluster, and as a possible implementation, one central topic word may be selected for each topic cluster. After setting the central topic word, each topic cluster may include two types of topic word candidates, one type being a topic word candidate corresponding to the central topic word and the other type being a topic candidate corresponding to the non-central topic word.
As a possible implementation, the process of setting topic word candidates in S202 may include the following steps:
step 1, calculating the burst level of each topic word candidate item according to word frequency information and fluctuation indexes of the topic word candidate item;
and 2, setting topic word candidates with highest burst level in the topic cluster as central topic words.
In the above feasible implementation manner, the word frequency information and the fluctuation index are used as the reference conditions of the topic words of the selection center, the word frequency information refers to the occurrence frequency of the topic word candidate items in the network text data, and the fluctuation index is used for describing the change condition of the topic word candidate items in the history word frequency information. The burst level of the corresponding topic word candidate item can be calculated by combining the word frequency information and the volatility index, and when the burst level of a certain topic word candidate item is higher, the topic word candidate item is described as being discussed less historically and is discussed frequently suddenly in a time period corresponding to the network text data. It will be appreciated that a word may be frequently raised in a short time when a significant event occurs in society, and thus the above-described possible embodiment may select topic word candidates having a high frequency of occurrence and a sudden nature as the center topic word. The method can be used for determining the text topics by the central topic words, so that the sudden topic detection of the network text data can be realized. The related discussion amount of a certain text topic is greatly increased in a period of time compared with the previous period of time, and the text topic is called as a burst topic.
Further, after calculating the burst level of each topic word candidate, there may be an operation of rejecting topic word candidates with burst levels lower than a preset level in the topic cluster. The topic word candidate item with lower burst level in the topic cluster can be filtered out through the topic word candidate item filtering operation, so that the topic word candidate item with higher burst level can be beneficial to determining the text topic in the process of determining the text topic according to the topic cluster in S204, and the finally determined text topic can represent the content which is suddenly discussed in the network text data.
S203, calculating topic dispersion of each central topic word in all discussion group text data.
In this embodiment, each central topic word is the content that is discussed in a centralized manner in each topic discussion group, and the burst level of the central topic word is highest in its corresponding topic cluster. By calculating topic dispersion of certain center topic words in all discussion group text data, it can be determined whether topics that are mainly discussed by a certain topic discussion group are also centrally discussed in other topic discussion groups. For example, in the game forum, for example, there are 4 topic discussion groups A, B, C and D, the central topic word of topic discussion group a is "stop", the central topic word of topic discussion group B is "update", the central topic word of topic discussion group C is "patch", and the central topic word of topic discussion group D is "trade" are determined based on the related operations of S201 and S202. The frequencies of words in the 4 topic discussion groups A, B, C and D for the center keyword "stop" are 37%, 30%, 29% and 5%, respectively, the frequencies of words in the 4 topic discussion groups A, B, C and D for the center keyword "update" are 32%, 39%, 31% and 6%, respectively, the frequencies of words in the 4 topic discussion groups A, B, C and D for the center keyword "patch" are 12%, 19%, 51% and 3%, respectively, and the frequencies of words in the 4 topic discussion groups A, B, C and D for the center keyword "trade" are 1%, 0%, 2% and 76%, respectively. Therefore, the topic dispersion of all the center topic words 'stop' and 'update' is higher than that of the center topic word 'patch', and the topic dispersion of the center topic word 'patch' is higher than that of 'trade'. Shutdown and update are discussed centrally by most users within the health care forum. Thus, explaining that the text data discussed in topic discussion group a and topic discussion group B is content generated for a certain burst time, the burst topic of web text data, i.e., the text topic, can be determined in conjunction with the content discussed in topic discussion group a and topic discussion group B.
S204, setting a topic cluster in which the central topic words with the topic dispersion degree larger than the first preset value are located as a target topic cluster, and determining text topics according to all the target topic clusters.
Although the central topic word is the highest burst level in all candidate topic words in the topic cluster, a single central topic word cannot accurately describe a text topic because one topic is usually composed of a plurality of words. For example, it is determined that the center topic word of a topic discussion group is "bug" and that the center topic word "bug" is more dispersed than a first preset value in all topic discussion groups. If the "bug" is used as the text topic at this time, it cannot be determined whether there is a bug, the bug has been repaired, or compensation of the bug is discussed in the web text data. Since other candidate topic words related to the central topic word may also be included in the topic cluster, i.e. there are other words of the same topic as the central topic word discussion, the present embodiment determines the text topic by combining multiple words in the target topic cluster together. According to the method and the device, whether the topic dispersion of the central topic words of the topic clusters is larger than a first preset value or not is determined, the topic clusters are screened to obtain target topic clusters, the content of the core discussion of the screened topic clusters is not frequently discussed by most topic discussion groups, candidate topic words in the target topic clusters are words frequently discussed by most topic discussion groups, and the detection accuracy of the text topics can be improved by determining the text topics according to the target topic words. Topic dispersion can describe the distribution of the same topic in the outer network over the same period.
As a possible implementation, the process of determining the text topic from all the target topic clusters in S204 may include the following operations:
s2041, setting a topic cluster in which center topic words with topic dispersion larger than a first preset value are located as a target topic cluster;
s2042, calculating a first co-occurrence rate of the central topic word and each non-central topic word of each target topic cluster;
the topic word candidates in the target topic cluster other than the central topic word are referred to herein as non-central topic words, and the first co-occurrence rate may describe a probability that the central topic word and each non-central topic word co-occur in the same sentence. If the co-occurrence rate of a certain non-central topic word and a central topic word is higher, the association degree of the non-central topic word and the central topic word is higher.
For example, the central topic word of the target topic cluster is "fruit", the non-central topic word is "price" and "quality", 50 sentences including the "fruit" in the network text data corresponding to the target topic cluster are provided, 40 sentences of the 50 sentences have the non-central topic word "price" and 15 sentences have the non-central topic word "quality", so that the first co-occurrence rate of the central topic word "fruit" and the non-central topic word "price" of the target topic cluster is 80%, the first co-occurrence rate of the non-central topic word "quality" is 30%, and the relevance of the "price" and the "fruit" can be inferred to be higher, and the discussion of the target topic cluster is related to the fruit price. The first co-occurrence rate is calculated based on sentences including central core topic words in the network text data corresponding to the target topic cluster, and may also be calculated according to all sentences corresponding to the target topic cluster, and the specific process is as follows: determining text sub-data corresponding to a target topic cluster in the network text data; setting the quantity ratio of the co-occurrence sentences to all sentences in the text sub-data as a first co-occurrence rate; wherein, the co-occurrence sentence is a sentence comprising each non-central topic word and central topic word in the text sub-data. Of course, there may be other ways of calculating the first co-occurrence rate, and no specific limitation is made here, as long as the first co-occurrence rate describing the degree of association of the center topic word with each non-center topic word can be obtained.
S2043, setting a central topic word and a non-central topic word with the first co-occurrence rate larger than a second preset value as target topic words;
the filtering operation is equivalent to performing filtering operation on topic word candidates in the target topic cluster, filtering non-central topic words with low correlation with the central topic word in the target topic cluster, wherein the obtained target topic words comprise the central topic word and the non-central topic word with the first co-occurrence rate being larger than a second preset value.
As a possible implementation manner, if the number of non-central keywords obtained in the step that the first co-occurrence rate is greater than the second preset value is greater than the preset number, the non-central keywords in the first Q bits before the first co-occurrence rate may be selected as the target keywords.
The method is equivalent to selecting target topic words by adopting a central diffusion mechanism, namely, the central diffusion mechanism takes the central topic words as the reference, other topic word candidates are sequenced and filtered according to the co-occurrence rate with the central topic words, and the number of topic words with a given number is selected.
It should be noted that the above-mentioned central diffusion mechanism can ensure accuracy of sudden topic discovery, but may result in a decrease in recall rate. This is mainly due to the structure of the topic discussion group, such as the main posting structure, for which the present embodiment is directed, the topic related to if the time span is relatively large may change. Taking the result of a certain game in a certain time period as an example, the detection of topic word candidates of a certain target topic cluster includes: "reclaim", "miss", "active bug", "compensate". From the target topic cluster, it can be analyzed that, from the time dimension, a bug appears at first, then a person reviews that the bug is missed, and finally the user discusses topics related to recovery and compensation. If only a central diffusion mechanism is adopted, only the topic of "reclamation" is found. Therefore, the embodiment can further introduce a heterogeneous mechanism, namely, assuming that a certain heterogeneous structure exists in the target topic cluster, and further performs a new round of screening on unselected topic word candidates on the premise of a central diffusion mechanism. Target words such as "active bug" and "miss" can be identified through heterogeneous mechanisms.
The process of identifying the target topic words based on the heterogeneous mechanism can comprise the following steps:
step 1, setting non-central thematic words with the first co-occurrence rate smaller than or equal to a second preset value and the burst level larger than a third preset value as new central thematic words;
and 2, calculating a second co-occurrence rate of the new central topic word and each non-new central topic word of each target topic cluster, and setting the non-new central topic word with the second co-occurrence rate larger than a second preset value as the target topic word.
The above-described operation of selecting the target topic word again based on the heterogeneous mechanism corresponds to the re-execution of the related operations of S202 to S204 for each topic word candidate not selected as the target topic word. The second co-occurrence rate is calculated in substantially the same manner as the first co-occurrence rate, and will not be described in detail herein.
It can be appreciated that multiple heterogeneous mechanisms may exist in the same target topic cluster, that is, topics discussed in one topic discussion group change multiple times, so that multiple target topic word recognition operations based on the heterogeneous mechanisms may be performed, where the number of times of performing may depend on the number of topic candidate words and time span in the target topic cluster, and the greater the number of topic candidate words, the greater the number of times of performing, and the greater the time span, the greater the number of times of performing.
And S2044, determining the text topic according to all the target topic words.
The target topic words determined in the embodiment are words with higher dispersion, higher word frequency and certain burst in the network text data, and text topics can be determined by integrating all the target topic words. As a possible implementation manner, when all the target topic words are words of the same category, text topics can be obtained according to the combination of the relation, grammar relation and generation time among the target topic words. For example, the target topic words include: "found", "bug", and "repaired", the text question that can be obtained from the relation between the target words and the grammatical relation is "after found bug", the bug repaired "may also be" repaired bug is found bug ", and if the generation time of" found "is earlier than" repaired ", the text question may be determined to be" after found bug repaired "based on the generation time of the target words. Of course, if all the target topic words include words of multiple categories or fields, the target topic words may be classified first, and then text topics corresponding to each category of target topic words may be generated. The reason why the word condition that the target topic word exists comprises a plurality of categories or fields is generated is that the selected web text data sources are scattered. Thus, as a possible implementation, the relevant information sharing platform can be selected as a source of web text data.
After the text topic is obtained, the target sentence text where the text topic is located can be determined, all the target sentence texts are scored according to the text similarity, and the target sentence texts with the top N positions are uploaded so that relevant staff can take corresponding countermeasures. Since text data discussing a topic mostly has the same words, sentences with higher text similarity can represent most of the views in the topic discussion group of text description, and the comparison is representative. The process of scoring all the target sentence texts according to the text similarity is specifically to calculate the similarity between each target sentence text, for example, the target sentence text includes: ABBB, ABBD, ABCB and ACBB, ABBB is the representative sentence, with the highest similarity to other sentence text.
As a further complement to the corresponding embodiment of FIG. 5, the operations of computing the boundary degrees of freedom for ordered frequent items and culling ordered frequent items with lower boundary degrees of freedom are as follows:
after the ordered frequent items are selected, calculating the boundary degree of freedom of each ordered frequent item by using a boundary degree of freedom calculation formula, and carrying out importance scoring and filtering on the ordered frequent items based on the boundary degree of freedom. The boundary degree of freedom can be obtained by entropy calculation of words on two sides of the ordered frequent term, and a specific calculation formula of the boundary degree of freedom is as follows:
Wherein, for ordered frequent item k, the boundary degree of freedom index is R k The total number of words at the left side of the ordered frequent item in each sentence of the network text data is m, the total number of words at the right side is n, and P L,i|k Representing the frequency of occurrence of the i-th term to the left of a given ordered frequent term k, P R,j|k Indicating how frequently a given ordered frequent term k appears in the j-th term to the right. Meter for measuring timeAnd after calculating the boundary degree of freedom of each ordered frequent item, the ordered frequent item with the boundary degree of freedom smaller than a sixth preset value can be removed from the topic word candidate items. The filtering operation based on the boundary degrees of freedom can select important ordered frequent items.
Referring to fig. 6, which is a schematic flow chart of another text topic detection method according to an embodiment of the present application, the method of the present embodiment may include:
s301, setting keywords in the web text data as topic word candidates.
In this embodiment, there may be an operation of performing word division on the web text data, and setting, as the keyword, a word having a frequency of occurrence greater than a fourth preset value.
S302, setting ordered frequent items in the web text data as topic word candidates.
Since the web text data is varied in form, many meaningful expressions may be made up of a plurality of words, which do not necessarily appear consecutively, for example, for the expression "active bug", expression forms of "active bug", "active bug" and so on may exist. In the embodiment, the keywords and the ordered frequent items are used as topic word candidates, so that the accuracy of topic detection is improved.
S303, calculating the boundary degree of freedom of each ordered frequent item, and removing the ordered frequent items with the boundary degree of freedom smaller than a sixth preset value from topic word candidates.
S304, the dialogue topic candidate item executes clustering operation based on the main posting structure to obtain topic clusters.
In this step, the topic word candidates may be clustered based on a community partitioning algorithm, so that the main post labels corresponding to the topic word candidates in the same topic cluster obtained after clustering are the same. Referring to fig. 7, a clustering schematic diagram of a Louvain community partitioning algorithm according to an embodiment of the present application is shown. The Louvain algorithm includes two phases, 1st pass in FIG. 7 referring to the first phase and 2nd pass to the second phase. In the first phase, nodes in the network are traversed continuously, trying to join a single node into a community that can maximize the modular boost until all nodes are no longer changing. In the second stage, the results of the first stage are processed, merging small communities into a supernode to reconstruct the network, where the edge weights are the sum of the edge weights of all the original nodes in the two nodes. And continuously iterating the steps of the first stage and the second stage until the algorithm is stable, so that a community dividing result can be obtained, namely, the clustering operation of topic word candidates is realized.
S305, calculating the burst level of each topic word candidate item according to the word frequency information and the volatility index of the topic word candidate item, and setting the topic word candidate item with the highest burst level in the topic cluster as a central topic word.
Wherein, after topic word candidate items are generated, a volatility index can be introduced so as to measure the burstiness of each candidate item. It should be emphasized that when scoring the sudden topic words, not only the volatility index of each topic word candidate item is considered, but also the occurrence frequency of each candidate item can be further considered.
S306, calculating topic dispersion of each central topic word in all discussion group text data.
Wherein, after the clustering operation of S304, one or more topic word candidates may be corresponding under each topic cluster. The embodiment constructs an index for reflecting the distribution condition of the central topic words in all network text data, namely topic dispersion. Because of the popularity of discussion of topics, topic-independent information items can be filtered by using topic dispersion indexes. Since topics of different observation objects (such as a certain application software, a network game, a design drawing, typhoon weather or a spring transport ticket) are different in discussion density, as a feasible implementation mode, the formula y=l can be adopted 1 +(log(g i,t )-log(N 1 )-0.1) 2 Calculating to obtain a minimum value of topic dispersion, namely a first preset value; wherein Y is a first preset value, l 1 G is a given minimum value i,t To the topic dispersion at the time t of i for a certain observation object, N 1 Is a given value.
S307, setting the topic cluster where the central topic words with the topic dispersion degree larger than the first preset value are located as a target topic cluster.
S308, determining text sub-data corresponding to the target topic cluster in the network text data.
S309, setting the ratio of the co-occurrence sentences to the number of all sentences in the text sub-data as a first co-occurrence rate.
S310, setting the central topic word and the non-central topic word with the first co-occurrence rate larger than the second preset value as target topic words.
In order to further screen topic word candidates in the target topic cluster, the embodiment adopts a central diffusion mechanism, sorts and filters other topic words according to the co-occurrence rate with the central topic words, and selects a given number of target topic words.
S311, setting the non-central topic words with the first co-occurrence rate smaller than or equal to the second preset value and the burst level larger than the third preset value as new central topic words.
S312, calculating a second co-occurrence rate of the new central topic word and each non-new central topic word of each target topic cluster, setting the new central topic word and the non-new central topic word with the second co-occurrence rate larger than a second preset value as target topic words, and determining the text topic according to all the target topic words.
Further, after determining the text topic, in order to better present the details of the topic, the embodiment may further score and sort the related sentences, and then select a representative sentence for reporting.
The present embodiment uses an adaptive mechanism for different observations, including introducing an adaptive mechanism in low frequency word filtering, adaptively selecting candidates, and adaptively filtering irrelevant information items. Based on the distribution characteristics of the network text data, clustering is carried out based on a community division algorithm, and irrelevant topic items are filtered out according to topic discrete degrees of the central keywords in each topic discussion group. The embodiment belongs to a data driving method, does not need to estimate parameters of a model, does not need to define word libraries in advance, can automatically generate topics with different numbers according to specific situations of data, and can realize unsupervised topic detection. In the process of selecting the target topic words, the method for detecting the text topics provided by the embodiment adopts a central diffusion mechanism and a heterogeneous mechanism for screening, and compared with other unsupervised methods, the method for detecting the text topics has higher accuracy and recall rate. The embodiment can timely know the text topics through monitoring the topic fluctuation of the network text data. The topic detection method has the advantages that topic clusters and topic words do not need to be defined in advance, topic words and topic clusters are deliberately discovered without supervision according to the fluctuation rule of the discussion words of users, and the defect of the supervised method can be well overcome.
In order to facilitate understanding of the solution according to the embodiments of the present application, the following description is provided in connection with a practical application scenario to which the solution according to the embodiments of the present application is applicable.
In this embodiment, the observation object for text topic detection is a mobile game owner and in order to obtain effective user feedback comments, the owner is a glory bar and a WeChat game circle as sources of network text data.
In order to embody timeliness and reliability of text topic detection, the embodiment selects network text data within 24 hours for topic detection. The topic word candidates in the network text data are determined to be 'ranked', 'team', 'upscale', 'skin' and 'Galois' according to the frequency of word occurrence. After the topic word candidate items are determined, scoring and sorting are carried out on all topic word candidate items according to the fluctuation index and the occurrence frequency, wherein all topic word candidate item burst grades are 'Galois', 'skin', 'rank', 'team' and 'upscale' in sequence from high to low. Further, the present application clusters the web text data based on the topic discussion group to obtain a plurality of topic clusters A, B, C and D, wherein the topic word candidates in the topic cluster a include "salve", "skin" and "rank" and the burst rank of "salve" is highest, the topic word candidates in the topic cluster B include "salve", "upscale" and "skin" and the burst rank of "upscale" is highest, the topic word candidates in the topic cluster C include "salve" and "rank" and the burst rank of "salve" is highest, and the topic word candidates in the topic cluster D include "salve", "group" and "rank" and the burst rank of "salve" is highest. "Galois" may be used as the central topic word for topic cluster A, C, D and "upscaling" may be used as the central topic word for topic cluster B. The topic dispersion of "galois" is calculated to be greater than the first preset value, and the topic dispersion of "upscaling" is calculated to be less than the first preset value, so that the topic cluster B can be filtered out, and the topic cluster A, C, D is set as the target topic cluster. In order to further screen the irrelevant information items in the target topic cluster, a central diffusion mechanism may be used to screen topic word candidates in the topic cluster A, C, D, for example, 80% of sentences including the central topic word "galois" in all sentences including the central topic word "skin" in the network text data corresponding to the topic cluster A, C, D include non-central keywords at the same time, 30% of sentences include non-central keywords at the same time "rank", and 20% of sentences include non-central keywords at the same time "upscaling", so that the co-occurrence rate of the "skin" and the central keywords is far greater than that of other non-central keywords. The "Galois" and "skin" are used as target topic words, and finally the text topic word "Galois skin" is determined in combination with the target topic word. If the publication time of the 'Galois' and the 'skin' is basically earlier than the 'ranking', the heterostructure exists in the network text data, and a new central topic word can be selected from the target topic cluster and obtained by a screening operation based on a central diffusion mechanism, so that another text topic is 'ranked'. It follows that the user feedback comments about the cell phone game master mainly include discussions about "skin of the gambling" and discussions about "ranking", i.e. the user first discusses skin of the gambling and then discusses ranking.
On the other hand, the application also provides a text topic detection device. For example, referring to fig. 8, which shows a schematic diagram of the composition structure of a text topic detection device according to an embodiment of the present application, the device according to the present embodiment may be applied to the server according to the above embodiment, and the device includes:
a central topic word determining module 21, configured to determine a central topic word of the web text data;
a dispersion calculating module 22, configured to calculate topic dispersions of the central topic words in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups;
the topic determination module 23 is configured to determine a text topic according to all central topic words with topic dispersion larger than a first preset value.
After the central topic words of the web text data are determined, the topic dispersion of each central topic word is determined. The topic dispersion can represent the distribution condition of the central topic words in the network text data corresponding to each topic discussion group, and if the topic dispersion is high, the central topic words are frequently discussed in most topic discussion groups corresponding to the network text data; if the topic dispersion is low, it is stated that the central topic word is discussed less frequently in all topic discussion groups corresponding to the web text data or only in individual topic discussion groups. The central topic words are the central words of the topics to be discussed in the network text data, the central topic words are screened based on topic dispersion, the text topics are determined through the central topic words with the topic dispersion higher than a first preset value, and the situation of text topic misjudgment caused by a large number of repeated list specific contents of a specific text data issuing community by individual users can be reduced. Therefore, the method and the device can filter out irrelevant topic items and improve the detection accuracy of text topics.
Further, the center topic word determining module 21 includes:
the clustering unit is used for determining topic word candidates in the network text data and performing clustering operation on the topic word candidates to obtain topic clusters; wherein all topic word candidates in the same topic cluster correspond to the same topic discussion group;
and the candidate screening unit is used for setting topic word candidates meeting preset conditions in the topic cluster as the central topic word.
Further, the topic determination module 23 is specifically configured to set, as a target topic cluster, a topic cluster in which all central topic words with topic dispersion greater than the first preset value are located, and determine the text topic according to all the target topic clusters.
Further, the topic determination module 23 includes:
a topic cluster screening unit, configured to set a topic cluster in which all the central topic words with topic dispersion larger than the first preset value are located as a target topic cluster,
the co-occurrence rate calculation unit is used for calculating a first co-occurrence rate of the central topic word and each non-central topic word of each target topic cluster;
a first target topic word setting unit, configured to set the central topic word and the non-central topic word with the first co-occurrence rate greater than a second preset value as target topic words;
And the text topic determining unit is used for determining the text topics according to all the target topic words.
Further, the co-occurrence calculating unit includes:
a text sub-data determining sub-unit, configured to determine text sub-data corresponding to the target topic cluster in the network text data;
a co-occurrence rate setting subunit, configured to set, as the first co-occurrence rate, a quantity ratio of co-occurrence sentences to all sentences in the text sub-data; wherein the co-occurrence sentence is a sentence in which each of the non-center topic words and the center topic words are included in the text sub-data.
Further, the method further comprises the following steps:
a new central word setting unit, configured to set, as a new central word, a non-central word whose first co-occurrence rate is less than or equal to the second preset value and whose burst level is greater than the third preset value;
and the second target topic word setting unit is used for calculating a second co-occurrence rate of the new central topic word and each non-new central topic word of each target topic cluster, and setting the non-new central topic word with the new central topic word and the second co-occurrence rate larger than the second preset value as the target topic word.
Further, the clustering unit includes:
a keyword selection subunit, configured to set keywords in the web text data as the topic word candidates; wherein the occurrence frequency of the keywords is larger than a fourth preset value;
and/or an ordered frequent item selection subunit, configured to set ordered frequent items in the web text data as the topic word candidates; the ordered frequent items are a plurality of words with fixed sequence in a target sentence pattern, and the occurrence frequency of the target sentence pattern in the network text data is larger than a fifth preset value.
Further, the method further comprises the following steps:
and the ordered frequent item screening subunit is used for calculating the boundary degree of freedom of each ordered frequent item and removing the ordered frequent item with the boundary degree of freedom smaller than a sixth preset value from the topic word candidate items.
Further, the candidate screening unit includes:
the burst level calculating subunit is used for calculating the burst level of each topic word candidate item according to the word frequency information and the fluctuation index of the topic word candidate item;
and the central topic word setting subunit is used for setting topic word candidates with highest burst level in the topic cluster as the central topic words.
Further, the method further comprises the following steps:
and the topic word candidate item screening subunit is used for removing topic word candidate items with burst grades lower than a preset grade in the topic cluster.
Further, the method further comprises the following steps:
the text determining module is used for determining a target sentence text where the text topic is located;
and the uploading module is used for scoring all the target sentence texts according to the text similarity and uploading the target sentence texts with the top N positions.
On the other hand, the present application further provides a server, referring to fig. 9, which shows a schematic structural diagram of a server according to an embodiment of the present application, where the server 2100 of the present embodiment may include: a processor 2101 and a memory 2102.
Optionally, the server may also include a communication interface 2103, an input unit 2104 and a display 2105 and a communication bus 2106.
The processor 2101, memory 2102, communication interface 2103, input unit 2104, display 2105, and all communicate with each other via communication bus 2106.
In the embodiment of the present application, the processor 2101 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, an off-the-shelf programmable gate array, or other programmable logic device.
The processor may call a program stored in the memory 2102. In particular, the processor may perform the operations performed by the server side in the following embodiments of the text topic detection method.
The memory 2102 is used to store one or more programs, and the programs may include program code that includes computer operation instructions, and in this embodiment, at least the programs for implementing the following functions are stored in the memory:
determining a central topic word of the web text data;
calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups;
and determining the text topic according to the central topic words with the topic dispersion degree larger than a first preset value.
In one possible implementation, the memory 2102 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, and at least one application program required for functions (such as topic detection functions, etc.), and the like; the storage data area may store data created during use of the computer.
In addition, memory 2102 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid state storage device.
The communication interface 2103 may be an interface of a communication module, such as an interface of a GSM module.
The application may also include a display 2105 and an input unit 2104, and so on.
Of course, the structure of the server shown in fig. 9 does not limit the server in the embodiment of the present application, and the server may include more or fewer components than shown in fig. 9 or may combine some components in practical applications.
On the other hand, the embodiment of the present application further provides a storage medium, in which a computer program is stored, where the computer program is loaded and executed by a processor, and is configured to implement the text topic detection method described in any one of the embodiments above.
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other. For the apparatus class embodiments, the description is relatively simple as it is substantially similar to the method embodiments, and reference is made to the description of the method embodiments for relevant points.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (18)

1. A text topic detection method, comprising:
determining topic word candidates in the network text data, and performing clustering operation on the topic word candidates to obtain topic clusters; wherein all topic word candidates in the same topic cluster correspond to the same topic discussion group;
setting topic word candidates meeting preset conditions in the topic cluster as central topic words of the network text data;
calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups; the topic dispersion represents the distribution condition of the central topic words in the network text data corresponding to each topic discussion group;
determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value;
the determining topic word candidates in the web text data comprises the following steps:
setting keywords in the web text data as topic word candidates; wherein the occurrence frequency of the keywords is larger than a fourth preset value;
And/or setting ordered frequent items in the web text data as the topic word candidates; the ordered frequent items are a plurality of words with fixed sequence in a target sentence pattern, and the occurrence frequency of the target sentence pattern in the network text data is larger than a fifth preset value.
2. The method of claim 1, wherein determining a text topic from a target center topic word comprises:
setting a topic cluster in which all the topic dispersion is larger than the central topic words of the first preset value as a target topic cluster, and determining the text topic according to all the target topic clusters.
3. The method of claim 2, wherein determining the text topic from all of the target topic clusters comprises:
calculating a first co-occurrence rate of a central topic word and each non-central topic word of each target topic cluster;
setting the central topic word and the non-central topic word with the first co-occurrence rate larger than a second preset value as target topic words;
and determining the text topic according to all the target topic words.
4. The method of claim 3, wherein calculating a first co-occurrence of a center topic word of each of the target topic clusters with each of the non-center topic words comprises:
Determining text sub-data corresponding to the target topic cluster in the network text data;
setting the quantity ratio of the co-occurrence sentences to all sentences in the text sub-data as the first co-occurrence rate; wherein the co-occurrence sentence is a sentence in which each of the non-center topic words and the center topic words are included in the text sub-data.
5. The text topic detection method of claim 3, further comprising, prior to determining the text topic from all of the target topic words:
setting the non-central topic words with the first co-occurrence rate smaller than or equal to the second preset value and the burst level larger than a third preset value as new central topic words;
and calculating a second co-occurrence rate of the new central topic word and each non-new central topic word of each target topic cluster, and setting the non-new central topic word with the second co-occurrence rate larger than the second preset value as the target topic word.
6. The method for detecting a topic of text according to claim 1, further comprising, before performing a clustering operation on the topic word candidates to obtain a topic cluster:
calculating the boundary degree of freedom of each ordered frequent item, and removing the ordered frequent items with the boundary degree of freedom smaller than a sixth preset value from the topic word candidate items;
The calculation formula of the boundary degree of freedom is as follows:
wherein, for ordered frequent item k, the boundary degree of freedom is R k The total number of words on the left side of ordered frequent items in each sentence of the web text data is m, the total number of words on the right side is n, and P L,i|k Representing the frequency of occurrence of the i-th term to the left of a given ordered frequent term k, P R,j|k Indicating how frequently a given ordered frequent term k appears in the j-th term to the right.
7. The text topic detection method according to any one of claims 1 to 6, wherein setting topic word candidates meeting preset conditions in the topic cluster as a center topic word of the web text data includes:
calculating the burst level of each topic word candidate item according to the word frequency information and the fluctuation index of the topic word candidate item;
and setting the topic word candidate item with the highest burst level in the topic cluster as the central topic word.
8. The method of claim 2, further comprising, after determining the text topic from all of the target topic clusters:
determining a target sentence text in which the text topic is located;
and scoring all the target sentence texts according to the text similarity, and uploading the target sentence texts with the top N scores.
9. A text topic detection device, characterized by comprising:
the clustering unit is used for determining topic word candidates in the network text data and performing clustering operation on the topic word candidates to obtain topic clusters; wherein all topic word candidates in the same topic cluster correspond to the same topic discussion group;
the candidate item screening unit is used for setting topic word candidates meeting preset conditions in the topic cluster as central topic words of the network text data;
the dispersion calculating module is used for calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups; the topic dispersion represents the distribution condition of the central topic words in the network text data corresponding to each topic discussion group;
the topic determination module is used for determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value;
wherein the clustering unit includes:
a keyword selection subunit, configured to set keywords in the web text data as the topic word candidates; wherein the occurrence frequency of the keywords is larger than a fourth preset value;
And/or an ordered frequent item selection subunit, configured to set ordered frequent items in the web text data as the topic word candidates; the ordered frequent items are a plurality of words with fixed sequence in a target sentence pattern, and the occurrence frequency of the target sentence pattern in the network text data is larger than a fifth preset value.
10. The text topic detection device of claim 9, wherein the topic determination module is specifically configured to:
setting a topic cluster in which all the topic dispersion is larger than the central topic words of the first preset value as a target topic cluster, and determining the text topic according to all the target topic clusters.
11. The text topic detection device of claim 10, wherein the topic determination module comprises:
the topic cluster screening unit is used for setting topic clusters where all the central topic words with topic dispersion larger than the first preset value are located as target topic clusters;
the co-occurrence rate calculation unit is used for calculating a first co-occurrence rate of the central topic word and each non-central topic word of each target topic cluster;
a first target topic word setting unit, configured to set the central topic word and the non-central topic word with the first co-occurrence rate greater than a second preset value as target topic words;
And the text topic determining unit is used for determining the text topics according to all the target topic words.
12. The apparatus according to claim 11, wherein the co-occurrence calculating unit includes:
a text sub-data determining sub-unit, configured to determine text sub-data corresponding to the target topic cluster in the network text data;
a co-occurrence rate setting subunit, configured to set, as the first co-occurrence rate, a quantity ratio of co-occurrence sentences to all sentences in the text sub-data; wherein the co-occurrence sentence is a sentence in which each of the non-center topic words and the center topic words are included in the text sub-data.
13. The text topic detection device of claim 11, further comprising:
a new central word setting unit, configured to set, as a new central word, a non-central word whose first co-occurrence rate is less than or equal to the second preset value and whose burst level is greater than a third preset value;
and the second target topic word setting unit is used for calculating a second co-occurrence rate of the new central topic word and each non-new central topic word of each target topic cluster, and setting the non-new central topic word with the new central topic word and the second co-occurrence rate larger than the second preset value as the target topic word.
14. The text topic detection device of claim 9, further comprising:
the ordered frequent item screening subunit is used for calculating the boundary degree of freedom of each ordered frequent item and removing the ordered frequent item with the boundary degree of freedom smaller than a sixth preset value from the topic word candidate items;
the calculation formula of the boundary degree of freedom is as follows:
wherein, for ordered frequent item k, the boundary degree of freedom is R k The total number of words on the left side of ordered frequent items in each sentence of the web text data is m, the total number of words on the right side is n, and P L,i|k Representing the frequency of occurrence of the i-th term to the left of a given ordered frequent term k, P R,j|k Indicating how frequently a given ordered frequent term k appears in the j-th term to the right.
15. The apparatus according to any one of claims 9 to 14, wherein the candidate screening unit includes:
the burst level calculating subunit is used for calculating the burst level of each topic word candidate item according to the word frequency information and the fluctuation index of the topic word candidate item;
and the central topic word setting subunit is used for setting topic word candidates with highest burst level in the topic cluster as the central topic words.
16. The text topic detection device of claim 10, further comprising:
the text determining module is used for determining a target sentence text where the text topic is located;
and the uploading module is used for scoring all the target sentence texts according to the text similarity and uploading the target sentence texts with the top N positions.
17. A server, comprising:
a processor and a memory;
wherein the processor is configured to execute a program stored in the memory;
the memory is used for storing a program, and the program is used for at least:
determining topic word candidates in the network text data, and performing clustering operation on the topic word candidates to obtain topic clusters; wherein all topic word candidates in the same topic cluster correspond to the same topic discussion group;
setting topic word candidates meeting preset conditions in the topic cluster as central topic words of the network text data;
calculating topic dispersion of the central topic word in all discussion group text data; the discussion group text data is obtained by dividing the network text data according to topic discussion groups; the topic dispersion represents the distribution condition of the central topic words in the network text data corresponding to each topic discussion group;
Determining a text topic according to the target central topic word; the target central topic words are central topic words with the topic dispersion larger than a first preset value;
the determining topic word candidates in the web text data comprises the following steps:
setting keywords in the web text data as topic word candidates; wherein the occurrence frequency of the keywords is larger than a fourth preset value;
and/or setting ordered frequent items in the web text data as the topic word candidates; the ordered frequent items are a plurality of words with fixed sequence in a target sentence pattern, and the occurrence frequency of the target sentence pattern in the network text data is larger than a fifth preset value.
18. A storage medium having stored therein computer executable instructions which when loaded and executed by a processor implement the steps of the text topic detection method of any of the preceding claims 1 to 8.
CN201910549752.6A 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium Active CN110245355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910549752.6A CN110245355B (en) 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910549752.6A CN110245355B (en) 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium

Publications (2)

Publication Number Publication Date
CN110245355A CN110245355A (en) 2019-09-17
CN110245355B true CN110245355B (en) 2024-02-13

Family

ID=67889064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910549752.6A Active CN110245355B (en) 2019-06-24 2019-06-24 Text topic detection method, device, server and storage medium

Country Status (1)

Country Link
CN (1) CN110245355B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111125484B (en) * 2019-12-17 2023-06-30 网易(杭州)网络有限公司 Topic discovery method, topic discovery system and electronic equipment
CN111324725B (en) * 2020-02-17 2023-05-16 昆明理工大学 Topic acquisition method, terminal and computer readable storage medium
CN111444337B (en) * 2020-02-27 2022-07-19 桂林电子科技大学 Topic tracking method based on improved KL divergence
US11803709B2 (en) 2021-09-23 2023-10-31 International Business Machines Corporation Computer-assisted topic guidance in document writing

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN106156182A (en) * 2015-04-20 2016-11-23 富士通株式会社 The method and apparatus that microblog topic word is categorized into specific field
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
WO2017041372A1 (en) * 2015-09-07 2017-03-16 百度在线网络技术(北京)有限公司 Man-machine interaction method and system based on artificial intelligence
CN106610931A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Extraction method and device for topic names
CN107203513A (en) * 2017-06-06 2017-09-26 中国人民解放军国防科学技术大学 Microblogging text data fine granularity topic evolution analysis method based on probabilistic model
CN109271493A (en) * 2018-11-26 2019-01-25 腾讯科技(深圳)有限公司 A kind of language text processing method, device and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN104063428A (en) * 2014-06-09 2014-09-24 国家计算机网络与信息安全管理中心 Method for detecting unexpected hot topics in Chinese microblogs
CN106156182A (en) * 2015-04-20 2016-11-23 富士通株式会社 The method and apparatus that microblog topic word is categorized into specific field
WO2017041372A1 (en) * 2015-09-07 2017-03-16 百度在线网络技术(北京)有限公司 Man-machine interaction method and system based on artificial intelligence
CN106610931A (en) * 2015-10-23 2017-05-03 北京国双科技有限公司 Extraction method and device for topic names
CN106066866A (en) * 2016-05-26 2016-11-02 同方知网(北京)技术有限公司 A kind of automatic abstracting method of english literature key phrase and system
CN106383877A (en) * 2016-09-12 2017-02-08 电子科技大学 On-line short text clustering and topic detection method of social media
CN107203513A (en) * 2017-06-06 2017-09-26 中国人民解放军国防科学技术大学 Microblogging text data fine granularity topic evolution analysis method based on probabilistic model
CN109271493A (en) * 2018-11-26 2019-01-25 腾讯科技(深圳)有限公司 A kind of language text processing method, device and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An Automatic Procedure for Vehicle Tracking with a Roadside LiDAR Sensor;Jianqing Wu;《2018 DANIEL B. FAMBRO STUDENT PAPER AWARD WINNER》;论文全文 *
混合模型的微博交叉话题发现;詹勇;杨燕;王红军;;计算机科学与探索(第08期);论文全文 *
语义约束和时间关联LDA的社交媒体主题词链提取;万红新;彭云;;小型微型计算机系统(第04期);论文全文 *
面向网络论坛话题发现的文本处理技术研究;吴伊萍;;赤峰学院学报(自然科学版)(第11期);论文全文 *

Also Published As

Publication number Publication date
CN110245355A (en) 2019-09-17

Similar Documents

Publication Publication Date Title
CN110245355B (en) Text topic detection method, device, server and storage medium
US11093568B2 (en) Systems and methods for content management
US9767166B2 (en) System and method for predicting user behaviors based on phrase connections
Oliveira et al. Can social media reveal the preferences of voters? A comparison between sentiment analysis and traditional opinion polls
US7685091B2 (en) System and method for online information analysis
US10354017B2 (en) Skill extraction system
US10614364B2 (en) Localized anomaly detection using contextual signals
CA2617954C (en) Method and system for extracting web data
US9946775B2 (en) System and methods thereof for detection of user demographic information
US9165254B2 (en) Method and system to predict the likelihood of topics
Nazir et al. Social media signal detection using tweets volume, hashtag, and sentiment analysis
US20150161633A1 (en) Trend identification and reporting
CN106599065A (en) Food safety online public opinion early warning system based on Storm distributed framework
Heredia et al. Exploring the effectiveness of twitter at polling the united states 2016 presidential election
Doshi et al. Predicting movie prices through dynamic social network analysis
CN110717089A (en) User behavior analysis system and method based on weblog
EP4044097A1 (en) System and method for determining and managing environmental, social, and governance (esg) perception of entities and industries through use of survey and media data
Liu et al. Oasis: Online analytic system for incivility detection and sentiment classification
Zheng et al. Identifying Labor Market Competitors with Machine Learning Based on Maimai Platform
Unnikrishnan et al. A Literature Review of Sentiment Evolution
Adedoyin-Olowe An association rule dynamics and classification approach to event detection and tracking in Twitter.
Wang et al. Timeline summarization for event-related discussions on a chinese social media platform
Daouia et al. Understanding World Economy Dynamics Based on Indicators and Events
Drif et al. Tracking diffusion pattern based on Salient Tweets
CN116070024A (en) Article recommendation method and device based on new energy cloud and user behavior

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant