CN117195013A - Topic clustering method, topic clustering device, electronic equipment and readable storage medium - Google Patents
Topic clustering method, topic clustering device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN117195013A CN117195013A CN202311077666.2A CN202311077666A CN117195013A CN 117195013 A CN117195013 A CN 117195013A CN 202311077666 A CN202311077666 A CN 202311077666A CN 117195013 A CN117195013 A CN 117195013A
- Authority
- CN
- China
- Prior art keywords
- topic
- clustering
- category
- topics
- clustering result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 230000002776 aggregation Effects 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims description 2
- 239000013598 vector Substances 0.000 description 8
- 238000007621 cluster analysis Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000003203 everyday effect Effects 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the specification discloses a topic clustering method, a topic clustering device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a plurality of topics of a social application platform at the current time; taking a plurality of topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; comparing the topic category number in the second clustering result with a preset clustering category number threshold; under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, the radius coefficient of the first clustering algorithm is adjusted until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; and determining hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Description
Technical Field
The present document relates to the field of computer technologies, and in particular, to a topic clustering method, a topic clustering device, an electronic device, and a readable storage medium.
Background
Many hot topics are generated on some social application platforms every day, and similar but not identical topics are created when different media are reported, so that the purposes of multiple branches are to be realized when a topic host. In order to avoid recommending duplicate content, the recommendation system of the social application platform needs to find out these similar topics and quickly select a representative from these similar topics.
Hot topics can burst at any time and the hot spot duration typically falls within 2 hours, even 20 minutes, to hot fade. The clustering operation must be completed in a short time, the faster and the better.
Disclosure of Invention
The embodiment of the application aims to provide a topic clustering method, a topic clustering device, electronic equipment and a readable storage medium, which are used for quickly clustering hot topics on a social application platform.
In order to solve the technical problems, the embodiment of the application is realized as follows:
in a first aspect, a topic clustering method is provided, including:
acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
Clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
In a second aspect, a topic clustering device is provided, including:
the social application platform comprises an acquisition unit, a search unit and a search unit, wherein the acquisition unit acquires a plurality of topics of the social application platform at the current time;
the first clustering unit is used for clustering the sample topic sets by using a first clustering algorithm to obtain a first clustering result by taking the topics as the sample topic sets; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
The second clustering unit is used for clustering the sample topic sets again by taking the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
the comparison unit is used for comparing the topic category number in the second clustering result with a preset clustering category number threshold;
the adjustment unit is used for adjusting the radius coefficient of the first clustering algorithm under the condition that the number of topic categories in the second clustering result does not accord with the clustering category number threshold value until the second clustering result which accords with the clustering category number threshold value is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and the processing unit is used for determining hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
In a third aspect, an electronic device is provided, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
Acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
And determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
In a fourth aspect, a computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:
acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
comparing the topic category number in the second clustering result with a preset clustering category number threshold;
Under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
As can be seen from the technical solutions provided by the embodiments of the present specification, the embodiments of the present specification have at least one of the following technical effects:
acquiring a plurality of topics of a social application platform at the current time; taking a plurality of topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category; clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second classification result comprises a plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category; comparing the topic category number in the second clustering result with a preset clustering category number threshold; under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, the radius coefficient of the first clustering algorithm is adjusted until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class; and determining hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Clustering a plurality of topics of a social application platform at the current time through a first clustering algorithm, and automatically selecting the quantity of center points of topic categories; further clustering is carried out on the basis of the first clustering result through a second clustering algorithm, so that more accurate topic classification and a center point contained in the topic classification can be obtained; optimizing a clustering result by adjusting a radius coefficient of the first clustering algorithm to ensure that the number of the clustering categories meets expectations; the method and the device can automatically acquire the trending topics on the social application platform and quickly perform cluster analysis on the trending topics, so that a more accurate hot event identification result is obtained.
Drawings
In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some of the embodiments described in the present description, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of an implementation of a topic clustering method according to an embodiment of the present disclosure.
Fig. 2 is a schematic structural diagram of a topic clustering device according to an embodiment of the present disclosure.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
For the purposes, technical solutions and advantages of this document, the technical solutions of this specification will be clearly and completely described below with reference to specific embodiments of this specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
A plurality of hot topics, such as a microblog platform, exist on a social application platform every day; different media stories create very similar but not exactly the same topics, aiming at multiple branches as the topic host. In order to avoid recommending duplicate content, the recommendation system of the social application platform needs to find out these similar topics and quickly select a representative topic from these similar topics.
The topic is a character string clamped by two well signs, and is a topic when a # net friend calls to see disappearing she, and the backrest is found to be needle#. "vectorization" is the conversion of topics into vectors by a language model, the length of which can be arbitrarily specified, e.g., 32, 64, 128, 768. The language model may be, but is not limited to, BERT, word2vec, etc. Putting together the topic vectors for 30 days yields a "topic vector library". Clustering all topic vectors to obtain a plurality of topic categories, wherein topics corresponding to the centers of the topic categories are defined as representative topics, and topics belonging to the same category are regarded as the same event.
Taking a microblog platform as an example, the clusters in the scene have the characteristics that:
the number of categories is not determined in advance;
the number of categories is more, the magnitude of all topics can reach 50 ten thousand, but the number of topics which can be searched by heat is not more than 1000 per day, and the corresponding events are not more than 1000, so that most events only have one topic;
the clustering is required to be fast, hot topics can burst at any time, and the hot spot duration is usually less than 2 hours, and even 20 minutes fades. The clustering operation must be completed within 5 minutes, the faster the better.
Although some related algorithms can meet the above requirements, there are some disadvantages, for example, the number of topic categories needs to be manually specified, the number of topic categories cannot be automatically determined, in a microblog topic clustering task, how many categories of topics are manually known every day cannot be predicted, and the clustering result is seriously dependent on the initial center, and if the initial center is poor in quality, the clustering result is also poor in quality.
In order to solve the problems of the related algorithm, the embodiment of the specification provides a topic clustering method for acquiring a plurality of topics of a social application platform at the current time; taking a plurality of topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category; clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second classification result comprises a plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category; comparing the topic category number in the second clustering result with a preset clustering category number threshold; under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, the radius coefficient of the first clustering algorithm is adjusted until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class; and determining hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Clustering a plurality of topics of a social application platform at the current time through a first clustering algorithm, and automatically selecting the quantity of center points of topic categories; further clustering is carried out on the basis of the first clustering result through a second clustering algorithm, so that topic classification with finer granularity and the quantity of center points can be obtained; optimizing a clustering result by adjusting a radius coefficient of the first clustering algorithm to ensure that the number of the clustering categories meets expectations; the method and the device can automatically acquire the trending topics on the social application platform and quickly perform cluster analysis on the trending topics, so that a more accurate hot event identification result is obtained.
The topic clustering method provided in the embodiments of the present disclosure is performed by a computer device, for example, at least one of a server, a notebook computer, a desktop computer, a tablet computer, or an intelligent robot. Alternatively, the execution subject of the topic clustering method may be a client (such as a social application) itself capable of executing the method.
For convenience of description, the implementation of the topic clustering method will be described below by taking an implementation body of the method as an electronic device capable of implementing the topic clustering method, where the electronic device may specifically be an electronic device such as a server, a notebook computer, a desktop computer, a tablet computer, or an intelligent robot. It will be appreciated that the subject of execution of the method is an exemplary illustration of an electronic device and should not be construed as limiting the method.
Fig. 1 is a schematic implementation flow diagram of a topic clustering method according to an embodiment of the present disclosure, including:
s110, obtaining a plurality of topics of the social application platform at the current time.
By means of an API (interface) of the social application platform or a third-party data analysis tool, detailed data such as a real-time hot search list, a topic label, hot users, topics and the like of the social application platform can be obtained, and the analysis and mining of the current topics can be facilitated.
Mapping the obtained N topics into topic vectors, and marking the topic vectors as x 1 ,x 2 ,...,x N The topic clustering method is used for subsequent topic clustering, and aims to divide all topic vectors into a plurality of classes, so that similar sample topics are close to each other, and different sample topics are far away from each other.
S120, taking a plurality of topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result.
And clustering the sample topic sets by using the topic vectors as the sample topic sets by using a first clustering algorithm to obtain a first clustering result.
Exemplary, first from a sample topic setIn the set of topics, a first topic x in the set of sample topics can be selected as an initial center point 1 As an initial center point, denoted as c 1 Or randomly selecting a topic from the sample topic set as a central point. Specifying a radius sigma 0 Radius sigma 0 Is determined from the radius coefficient r, the mean and variance of the sample topic set, and in one embodiment,the radius coefficient r can be 0.2, and the variance of the sample topic setMean ∈of sample topic set>
Each topic x is then checked in subscript order i I=1, 2,..n, if x i With the initial center c 1 Distance of less than or equal to radius sigma 0 Then topic x i Marking the first category, i.e. the first topic category, and using topic x i To update the initial center point c 1 Let the initial center point c 1 To topic x i Sliding a proper distance; if topic x i From the initial center point c 1 Is greater than the radius, then topic x i As a new center point c 2 And take topic x i Labeled as the second category, i.e., the second topic category.
Fruit x is to be noted i With the initial center c 1 The distance of (2) may be, but is not limited to, euclidean distance, cosine distance, etc.
Assuming that K centerpoints have been generated c 1 ,c 2 ,...,c N . Calculating topic x i And a center point c 1 ,c 2 ,...,c N Distance between, if x i Falling within a radius of at least 1 center point, then find topic x i Center point c closest to k Then the sample topic x i Marked as the kth class, and using x i To update the center point c k Center point c k To topic x i Sliding a proper distance; if topic x i Outside the radius of all the center points, then x will be i As a new center point c k+1 And x is taken as i Marked as class k+1. This is done until all topics x i Traversing the first clustering result to obtain a first clustering result, wherein the first clustering result comprises K topic categories, and a first central point c of each topic category k And the first topic number, n, contained in each topic category 1 ,n 2 ,...,n K ,n 1 For a first number of topics contained in a first topic category, n K The first number of topics contained for the kth topic category.
In the microblog topic clustering task, people cannot predict how many categories of topics are on each day, the clustering result is seriously dependent on the initial center, and if the initial center is poor in quality, the clustering result is also poor in quality. The topic sets are clustered by using the first clustering algorithm, the number of topic categories is not required to be manually specified, the number of topic categories can be automatically determined, similar topics can be classified into the same category, and therefore a representative topic can be better and quickly selected from the similar topics.
S130, clustering the sample topic sets again by a second clustering algorithm by taking the first central point of each topic category as a center to obtain a second clustering result.
Sequentially determining the distance between each topic and the existing first center point by taking all the first center points as centers; classifying topics into topic categories of a first center point corresponding to the smallest distance in the distances, and obtaining a second center point by updating the position of the first center point through the topics; after traversing each topic, obtaining a plurality of topic categories, a second center point of each topic category and the number of topics contained in each topic category.
Illustratively, topic x i The number of the first center points K, K topic categories of the first clustering result, and the first center point c of each topic category 1 ,c 2 ,...,c k And the first topic number n contained in each topic category 1 ,n 2 ,...,n K As input to the second clustering algorithm, the first center point sequence c k Keeping in mind the first number n of topics contained in the K topic categories 1 ,n 2 ,...,n K Zero clearing, i.e. let n K =0。
Then calculate topic x i To a first central point c k Distance d of (d) ik =dist(x i ,c k ) K=1, 2,..k. Pick out distance topic x i The nearest first center point adds 1 to the topics contained in the first center point. And so on, traverse topic x i Obtaining a second clustering result, wherein the second clustering result comprises K topic categories, the number of second center points K and the second center point c of each topic category 1 ,c 2 ,...,c k′ And a second number of questions n contained in each question category 1 ,n 2 ,...,n K′ 。
Further clustering is performed on the basis of the first clustering result through a second clustering algorithm, so that more accurate topic classification and a center point contained in the topic classification can be obtained.
It should be noted that, the second clustering algorithm directly uses the first center point as the center to re-cluster the topic sets, so that the number of the second center points is equal to that of the first center points, topic categories in the second clustering result are also equal to that in the first clustering result, and the difference is that the number of topics included in each topic category in the second clustering result is different. Therefore, after the second classification result is obtained, the first topic number and the second topic number corresponding to the topic category corresponding to the same category identification are also required to be compared. When the number of the second topics is smaller than or equal to the number of the first topics, the position of the corresponding center point is identified through the topic updating category contained in the first center point and is used as the position of the second center point; when the second topic number is greater than the first topic number, the topic update category included in the second center point identifies the position of the corresponding center point as the position of the second center point. The category identifiers are the numbers of topic categories, the number 1 is the first topic category, the number 2 is the second topic category, and the description is omitted here.
For example, in the first topic category c 1 For example, the category identifier is 1, the second number of questions included in the corresponding second center point is 10000, the first number of questions included in the first center point is 8000, and the second number of questions is greater than the first number of questions, so that the position of the center point corresponding to the category identifier 1 is updated through the topics included in the second center point to serve as the position of the second center point. The specific updating mode is that when traversing topics, topic category c 1 The center point is brought closer to the newly added topic by a suitable distance, which is obtained based on empirical data or obtained in other ways, and will not be described in detail herein.
It should be noted that, when the topics are traversed through the second clustering algorithm, instead of traversing one by one, vectorization computing capacity of the CPU and block-by-block reading and writing capacity of the memory are fully utilized, a plurality of sample topics are traversed once, single-step computing amount is increased, and clustering results are obtained quickly.
And S140, comparing the topic category number in the second clustering result with a preset clustering category number threshold.
After a plurality of rounds of iteration of the second clustering algorithm reach a stable state, whether the number of center points is increased or the number of center points is reduced is determined according to the comparison condition of the number of topic categories in the second clustering result and a preset clustering category number threshold value, wherein the center points are the centers of each clustering category, and therefore the number of center points is equal to the number of clustering categories.
The cluster category number threshold includes a first threshold, a second threshold, and a category number threshold range.
And S150, under the condition that the number of topic categories in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result.
If too many center points are present, if the number of center points is greater than a first threshold, the radius coefficient r is scaled up somewhat and restarted from the first clustering algorithm. The larger the radius coefficient r is, the larger the radius of each cluster category is, the more topics each cluster category contains, the fewer the number of cluster categories is, and the fewer the number of center points is.
If the number of center points is not large, if the number of center points is smaller than the second threshold, the radius coefficient r needs to be reduced, and the first clustering algorithm is restarted. The smaller the radius coefficient r is, the smaller the radius of each cluster category is, the fewer topics each cluster category contains, the more cluster categories are, and the more center points are.
If the number of center points is not too large, if the number of center points is within a category number threshold, which is a sub-interval of the interval range of the second threshold and the first threshold, then a third algorithm is used to subtract some. And under the condition that the number of the center points is larger than the second threshold value and smaller than the first threshold value and is out of the range of the category number threshold value, sequentially deleting the topic categories with the least number of the contained topics until the number of the topic categories in the range of the category number threshold value is obtained. For example, topic categories with a topic number of 0 may be deleted first, and then topic categories with a topic number of 1 may be deleted until the center point number is within the category number threshold.
The first threshold, the second threshold, and the category number threshold range are all determined based on the user's needs, or based on other possible manners, which are not limited by the present disclosure.
S160, determining a hot event in the current topic according to the topic number contained in each topic category in the target clustering result.
Determining the number of topics contained in each topic category in the target clustering result, sorting the topic categories in a descending order according to the contained topic number, selecting the topic category with the designated number before sorting as the target topic category, and taking the topic corresponding to the central point of the target topic category as the hot event. By way of example, the topic category selected for ranking in the top 50 may be taken as the target topic category.
In summary, according to the topic clustering method provided by the embodiment of the present disclosure, a plurality of topics of a social application platform at a current time are obtained; taking a plurality of topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category; clustering the sample topic sets again by using all the first center points as centers through a second clustering algorithm to obtain a second clustering result; the second classification result comprises a plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category; comparing the topic category number in the second clustering result with a preset clustering category number threshold; under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, the radius coefficient of the first clustering algorithm is adjusted until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for adjusting the radius of the topic class; and determining hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Clustering a plurality of topic topics of a social application platform at the current time through a first clustering algorithm, and automatically selecting the quantity of center points of topic categories; further clustering is carried out on the basis of the first clustering result through a second clustering algorithm, so that more accurate topic classification and a center point contained in the topic classification can be obtained; optimizing a clustering result by adjusting a radius coefficient of the first clustering algorithm to ensure that the number of the clustering categories meets expectations; the method and the device can automatically acquire the trending topics on the social application platform and quickly perform cluster analysis on the trending topics, so that a more accurate hot event identification result is obtained.
Fig. 2 is a schematic structural diagram of a topic clustering device 200 according to an embodiment of the present disclosure, including:
an acquiring unit 210 that acquires a plurality of topics of the social application platform at the current time;
the first clustering unit 220 uses the topics as a sample topic set, and clusters the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
the second clustering unit 230 re-clusters the sample topic sets by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
The comparison unit 240 compares the topic category number in the second clustering result with a preset clustering category number threshold;
an adjusting unit 250, configured to adjust a radius coefficient of the first clustering algorithm until a second clustering result meeting the clustering category number threshold is obtained as a target clustering result, where the number of topic categories in the second clustering result does not meet the clustering category number threshold; the radius coefficient is used for representing the radius of the topic class;
the processing unit 260 determines the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Optionally, in an embodiment, the first clustering unit 220 is configured to:
determining the radius of each topic category after clustering according to the radius coefficient of the first clustering algorithm and the mean value and variance of the sample topic set;
selecting one topic from the sample topic set as an initial center point;
sequentially traversing to determine whether each topic is within the radius range of the existing center point;
classifying topics within the radius range as topic categories of the corresponding center points;
Taking topics outside the radius range as new center points;
after each topic in the sample topic set is traversed, the topic categories and the first center point of each topic category are obtained.
Optionally, in an embodiment, the second aggregation unit 230 is configured to:
sequentially determining the distance between each topic in the sample topic set and each existing first center point;
classifying the topics into topic categories of first center points corresponding to the minimum distance, and updating the positions of the first center points of the topic categories through the topics to obtain second center points of the topic categories;
after traversing each topic, obtaining the topic categories, the second center point of each topic category and the topic quantity contained in each topic category.
Optionally, in an embodiment, the processing unit 260 is configured to:
comparing the first topic number and the second topic number corresponding to the same topic category;
updating the position of a center point corresponding to the topic category through topics contained in the first center point of the topic category as a second center point of the topic category when the second topic number is smaller than or equal to the first topic number;
And when the second topic number is larger than the first topic number, updating the position of the center point corresponding to the topic category through topics contained in the second center point of the topic category as the second center point of the topic category.
Optionally, in an embodiment, the processing unit 260 is configured to:
increasing the radius coefficient if the topic category number is greater than the first threshold; and reducing the radius coefficient when the topic category number is smaller than the second threshold value until the topic category number is larger than the second threshold value and smaller than the first threshold value.
Optionally, in an embodiment, the processing unit 260 is configured to:
and under the condition that the topic category number is larger than the second threshold and smaller than the first threshold, and is out of the category number threshold, sequentially deleting topic categories with the smallest topic number until the topic category number within the category number threshold is obtained.
Optionally, in an embodiment, the processing unit 260 is configured to:
sorting in a descending order according to the topic number contained in each topic category in the target clustering result;
Determining topic categories ordered by the specified quantity in front as target topic categories;
and taking the target topic category as the hot event.
The topic clustering device 200 can implement the method of the method embodiment of fig. 1, and specifically, the topic clustering method of the embodiment shown in fig. 1 may be referred to, and will not be described in detail.
Fig. 3 is a schematic structural view of an electronic device according to an embodiment of the present specification. Referring to fig. 3, at the hardware level, the electronic device includes a processor, and optionally an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, network interface, and memory may be interconnected by an internal bus, which may be an ISA (Industry Standard Architecture ) bus, a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 3, but not only one bus or type of bus.
And the memory is used for storing programs. In particular, the program may include program code including computer-operating instructions. The memory may include memory and non-volatile storage and provide instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the topic clustering device on the logic level. The processor is used for executing the programs stored in the memory and is specifically used for executing the following operations:
acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
Comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Clustering a plurality of topics of a social application platform at the current time through a first clustering algorithm, and automatically selecting the quantity of center points of topic categories; further clustering is carried out on the basis of the first clustering result through a second clustering algorithm, so that more accurate topic classification and a center point contained in the topic classification can be obtained; optimizing a clustering result by adjusting a radius coefficient of the first clustering algorithm to ensure that the number of the clustering categories meets expectations; the method and the device can automatically acquire the trending topics on the social application platform and quickly perform cluster analysis on the trending topics, so that a more accurate hot event identification result is obtained.
The method executed by the topic clustering device disclosed in the embodiment of fig. 1 of the present application can be applied to a processor or implemented by the processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method.
The electronic device may further execute the method of fig. 1 and implement the function of the topic clustering device in the embodiment shown in fig. 1, which is not described herein.
Of course, other implementations, such as a logic device or a combination of hardware and software, are not excluded from the electronic device of the present application, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or a logic device.
The embodiments of the present application also provide a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, enable the portable electronic device to perform the method of the embodiment shown in fig. 3, and in particular to perform the operations of:
acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
Clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
In summary, the foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
Claims (10)
1. A topic clustering method, the method comprising:
acquiring a plurality of topics of a social application platform at the current time;
Taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
2. The method of claim 1, wherein clustering the sample topic sets using a first clustering algorithm with the plurality of topics as the sample topic sets to obtain a first clustering result comprises:
determining the radius of each topic category after clustering according to the radius coefficient of the first clustering algorithm and the mean value and variance of the sample topic set;
selecting one topic from the sample topic set as an initial center point;
sequentially traversing to determine whether each topic is within the radius range of the existing center point;
classifying topics within the radius range as topic categories of the corresponding center points;
taking topics outside the radius range as new center points;
after each topic in the sample topic set is traversed, the topic categories and the first center point of each topic category are obtained.
3. The method of claim 2, wherein re-clustering the sample topic sets with a second clustering algorithm centered on the first center point of each topic category to obtain a second clustering result, comprising:
sequentially determining the distance between each topic in the sample topic set and each existing first center point;
Classifying the topics into topic categories of first center points corresponding to the minimum distance, and updating the positions of the first center points of the topic categories through the topics to obtain second center points of the topic categories;
after traversing each topic, obtaining the topic categories, the second center point of each topic category and the topic quantity contained in each topic category.
4. The method of claim 3, further comprising, after obtaining the second aggregation result:
comparing the first topic number and the second topic number corresponding to the same topic category;
updating the position of a center point corresponding to the topic category through topics contained in the first center point of the topic category as a second center point of the topic category when the second topic number is smaller than or equal to the first topic number;
and when the second topic number is larger than the first topic number, updating the position of the center point corresponding to the topic category through topics contained in the second center point of the topic category as the second center point of the topic category.
5. The method of claim 4, wherein the cluster category number threshold comprises a first threshold and a second threshold, the first threshold being greater than the second threshold; and under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result, wherein the method comprises the following steps:
Increasing the radius coefficient if the topic category number is greater than the first threshold; and reducing the radius coefficient when the topic category number is smaller than the second threshold value until the topic category number is larger than the second threshold value and smaller than the first threshold value.
6. The method of claim 5, wherein the cluster category number threshold comprises a category number threshold range, the category number threshold range being a sub-interval of an interval range determined by the second threshold and the first threshold; and under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result accords with the clustering category number threshold is obtained, wherein the method comprises the following steps:
and under the condition that the topic category number is larger than the second threshold and smaller than the first threshold, and is out of the category number threshold, sequentially deleting topic categories with the smallest topic number until the topic category number within the category number threshold is obtained.
7. The method of claim 1, wherein determining a hotspot event in the plurality of topics based on the number of topics included in each topic category in the target cluster result comprises:
Sorting in a descending order according to the topic number contained in each topic category in the target clustering result;
determining topic categories ordered by the specified quantity in front as target topic categories;
and taking the target topic category as the hot event.
8. A topic clustering device, characterized by comprising:
the social application platform comprises an acquisition unit, a search unit and a search unit, wherein the acquisition unit acquires a plurality of topics of the social application platform at the current time;
the first clustering unit is used for clustering the sample topic sets by using a first clustering algorithm to obtain a first clustering result by taking the topics as the sample topic sets; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
the second clustering unit is used for clustering the sample topic sets again by taking the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
the comparison unit is used for comparing the topic category number in the second clustering result with a preset clustering category number threshold;
The adjustment unit is used for adjusting the radius coefficient of the first clustering algorithm under the condition that the number of topic categories in the second clustering result does not accord with the clustering category number threshold value until the second clustering result which accords with the clustering category number threshold value is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and the processing unit is used for determining hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
9. An electronic device, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
Comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
10. A computer-readable storage medium storing one or more programs that, when executed by an electronic device comprising a plurality of application programs, cause the electronic device to:
acquiring a plurality of topics of a social application platform at the current time;
taking the topics as a sample topic set, and clustering the sample topic set by using a first clustering algorithm to obtain a first clustering result; the first clustering result comprises a plurality of topic categories, a first center point of each topic category and a first topic number contained in each topic category;
Clustering the sample topic sets again by using the first center point of each topic category as a center through a second clustering algorithm to obtain a second clustering result; the second clustering result comprises the plurality of topic categories, a second center point of each topic category and a second topic number contained in each topic category;
comparing the topic category number in the second clustering result with a preset clustering category number threshold;
under the condition that the topic category number in the second clustering result does not accord with the clustering category number threshold, adjusting the radius coefficient of the first clustering algorithm until the second clustering result which accords with the clustering category number threshold is obtained as a target clustering result; the radius coefficient is used for representing the radius of the topic class;
and determining the hot events in the topics according to the number of topics contained in each topic category in the target clustering result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311077666.2A CN117195013A (en) | 2023-08-24 | 2023-08-24 | Topic clustering method, topic clustering device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311077666.2A CN117195013A (en) | 2023-08-24 | 2023-08-24 | Topic clustering method, topic clustering device, electronic equipment and readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117195013A true CN117195013A (en) | 2023-12-08 |
Family
ID=89002661
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311077666.2A Pending CN117195013A (en) | 2023-08-24 | 2023-08-24 | Topic clustering method, topic clustering device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117195013A (en) |
-
2023
- 2023-08-24 CN CN202311077666.2A patent/CN117195013A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10997460B2 (en) | User identity determining method, apparatus, and device | |
US9779356B2 (en) | Method of machine learning classes of search queries | |
CN108596410B (en) | Automatic wind control event processing method and device | |
CN113342905B (en) | Method and device for determining stop point | |
JP2014515514A (en) | Method and apparatus for providing suggested words | |
CN108335131B (en) | Method and device for estimating age bracket of user and electronic equipment | |
CN113010778A (en) | Knowledge graph recommendation method and system based on user historical interest | |
CN105574030A (en) | Information search method and device | |
CN109886300A (en) | A kind of user's clustering method, device and equipment | |
WO2020253369A1 (en) | Method and device for generating interest tag, computer equipment and storage medium | |
CN113672793A (en) | Information recall method and device, electronic equipment and storage medium | |
CN113064930A (en) | Cold and hot data identification method and device of data warehouse and electronic equipment | |
CN110334104B (en) | List updating method and device, electronic equipment and storage medium | |
CN113656575B (en) | Training data generation method and device, electronic equipment and readable medium | |
CN111310834A (en) | Data processing method and device, processor, electronic equipment and storage medium | |
CN112905885B (en) | Method, apparatus, device, medium and program product for recommending resources to user | |
CN107392220B (en) | Data stream clustering method and device | |
CN114490786A (en) | Data sorting method and device | |
CN109933691A (en) | Method, apparatus, equipment and storage medium for content retrieval | |
CN107368281B (en) | Data processing method and device | |
CN115129791A (en) | Data compression storage method, device and equipment | |
CN117113174A (en) | Model training method and device, storage medium and electronic equipment | |
CN114840762B (en) | Recommended content determining method and device and electronic equipment | |
CN117195013A (en) | Topic clustering method, topic clustering device, electronic equipment and readable storage medium | |
CN110866085A (en) | Data feedback method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |