CN118093860A

CN118093860A - Multi-level scientific research topic mining method based on text embedded vector clustering

Info

Publication number: CN118093860A
Application number: CN202410329914.6A
Authority: CN
Inventors: 金源; 田阳杰; 张鹤; 李沄沨; 许若华; 李宁
Original assignee: Cetc Digital Intelligence Technology Beijing Co ltd
Current assignee: Cetc Digital Intelligence Technology Beijing Co ltd
Priority date: 2024-03-21
Filing date: 2024-03-21
Publication date: 2024-05-28

Abstract

The invention belongs to the technical field of topic mining, and particularly relates to a multi-level scientific research topic mining method based on text embedded vector clustering, which comprises the steps of clustering an article set to obtain keyword semantic clusters, constructing article distribution vectors based on the keyword semantic clusters, and clustering to obtain article clusters to complete the construction of the article clusters; constructing an article cluster index set tree based on the keyword semantic clusters and the article clusters; constructing a multi-level topic network based on the article cluster index set tree, and determining scientific research topic relation; the multi-level theme can be built more effectively and efficiently, and the built multi-level theme presents a richer network structure. The application of the Text Embedded Clustering (TEC) method in the topic relation mining is expanded from a single-layer tree form to a multi-level network form.

Description

Multi-level scientific research topic mining method based on text embedded vector clustering

Technical Field

The invention belongs to the technical field of topic mining, and particularly relates to a multi-level scientific research topic mining method based on text embedded vector clustering.

Background

Due to the large-scale popularity of neuro-linguistic models and semantic embedding techniques, various types of contextual Text Embedding Clustering (TEC) methods for different granularity levels (e.g., words, sentences, and articles) have met with some success in the field of topic mining. The research shows that the representative subject words under each subject mined by the TEC method show good consistency. The consistency degree reaches or even partially exceeds the prior topic model method based on statistical modeling or neural network. In these methods, si a et a.i. (2020) globally embeds each keyword in the vocabulary and clusters the embedded results. These embedded vectors are obtained by (pre) training natural language models such as Word2Vec and BERT. For each topic cluster obtained by clustering, the study uses a weighted rearrangement mode based on TF-IDF to obtain their representative topic words. Mimno et a l (2020) then takes into account the clustering of the instances of each keyword in the corpus that exist in the text collection and their respective semantic embedded vectors. On the other hand, grootendorst (2020) proposes to cluster the article embedding vectors calculated from the pre-trained BERT model to obtain article clusters, and to obtain the subject terms of each article cluster using TF-IDF weighting strategy for each cluster. Such methods are more prone to exploring topic associations between articles than semantic associations between keywords.

For multi-level topic mining, the aim is to automatically identify interpretable topics and their parent-child relationships and systematically express them with topology. Such a structure may help humans better understand complex corpus content. For this task, the traditional bayesian topic model, such as LDA, has been expanded to automatically infer and construct a hierarchical topic conforming to a tree structure by using co-occurrence and inclusion of keywords in text. The most influential work in this field is h-LDA. The model can extract topics from thick to thin for each article reasoning. In addition to the traditional bayesian approach, the TEC approach is increasingly being applied to the task of multi-level topic mining. The research work on this branch focused on clustering keyword embedded vectors in a top-down recursive manner. For all topics under each hierarchy they come from the clustering results of the previous hierarchy. In a specific embodiment, the article subsets are sub-clusters obtained by further clustering the article subsets under a certain parent cluster in the upper hierarchy. At the same time, their keywords must only come from a subset of articles under the parent cluster, and the embedded vectors must only be trained on the article sets in the respective child clusters after clustering. This ensures refinement of the hierarchical subject matter relative to the parent subject matter of the previous hierarchy. The pioneering work of this branch was called Taxogen and was proposed in 2019. The study generates theme clusters step by step in the recursion mode; for each cluster generated, the local embedded vectors for the cluster subject Word are learned from the article subsets assigned to the clusters by Word2Vec and the next level of clustering is performed on these vectors. Subsequent studies inherit this recursive hierarchical clustering framework and local embedded vector learning for each cluster. However, their main research point is how to supplement the existing topic hierarchy (usually in line with the tree hierarchy) with the given article set, based on the latter, rather than building a new and more complex topic hierarchy from scratch

Current research related to multi-level topic mining has focused mainly on top-down recursive multi-level topic tree construction by word/word embedding vectors. However, these efforts do not involve building more complex multi-level topic network graph structures, while they tend to detect associations between topics based on semantic relevance rather than scientific topic relevance or statistical relevance. This may result in them generating topic relationships that do not actually exist in some text sets but are semantically similar. On the other hand, multi-level topic models based on traditional statistical modeling, such as hLDA, tend to build topic trees by capturing the number of co-occurrences of keywords under different topics. However, such conventional models do not have scalability and flexibility in processing large-scale data.

Disclosure of Invention

The invention aims to provide a multi-level scientific research topic mining method based on text embedded vector clustering, which expands the application of a Text Embedded Clustering (TEC) method in topic relation mining from a single-level tree form to a multi-level network form so as to solve the problems in the prior art in the background technology.

In order to achieve the above purpose, the invention adopts the following technical scheme: a multi-level scientific research topic mining method based on text embedded vector clustering comprises the following steps: clustering the article set by inputting the article set to obtain a keyword semantic cluster; then, constructing article distribution vectors based on the keyword semantic clusters and clustering to obtain article clusters, so that the construction of the article clusters is completed; further constructing an article cluster index set tree based on the keyword semantic clusters and the article clusters; and constructing a multi-level topic network based on the article cluster index set tree, and determining the scientific research topic relation.

Preferably, the clustering obtains keyword semantic clusters, including: calculating by using a pre-training language model to obtain an embedded vector of each keyword appearing in the article set each time; and clustering the embedded vectors to obtain a certain number of keyword semantic clusters.

Preferably, the article distribution vector includes: calculating cosine similarity between the article embedded vector and the central vector of each keyword semantic cluster; the similarity score is regularized to obtain a probability distribution vector.

Preferably, the clustering obtains an article cluster, including: and applying an X-Means clustering algorithm to sparse distribution vectors of the articles to obtain different article clusters.

Preferably, the building the article cluster index set tree includes: for each set to be inserted, traversing all nodes at the next level of the root node in the existing tree; matching the target set with the index set corresponding to the node; by continuously inserting the target index set into the tree, an index set tree in which the subject relationships of the scientific research are recorded is obtained.

Preferably, the matching the target set with the index set corresponding to the node includes: if the target set comprises an index set of a certain node, the path under the node is a candidate path; if none of the nodes meets the inclusion condition, the target set is inserted under the root node as a new primary node; for each candidate path, traversing all nodes on the path according to the sequence from top to bottom, and calculating the similarity between the currently traversed node and the target set; if the similarity of the two is lower than a certain threshold value, stopping traversing, and inserting the target index set into the child node which is the last node.

Preferably, the formula for calculating the similarity between the currently traversed node and the target set is as follows:

Where { D } represents the pre-trained semantic vector of article D, D _s represents the article cluster s to which D belongs, P (d|s) represents the probability of generating D under the article cluster s, P (d|k) represents the probability of generating D from the keyword semantic cluster k, and P (k|s) is the probability of generating the keyword semantic cluster k from the article cluster s.

Preferably, the selecting manner of the threshold value includes: the lower threshold is chosen in the early stages of the traversal, after which the high threshold is employed.

Preferably, the constructing a multi-level topic network based on the article cluster index set tree includes: the nodes of the first layer are corresponding to different general subject matters, and the nodes of the second layer or lower layer represent different scientific research/subject crossing subject matters.

Preferably, the determining the scientific research topic relation includes: traversing the article family index set from large to small, and finding all father nodes of the index set in the tree; and taking all the father node index sets as union sets, and calculating a difference set between the target index set and the union sets, wherein the difference set only contains index numbers corresponding to the general subject matters, namely, the father node of the target index set contains the general subject matters.

The invention has the technical effects and advantages that: compared with the prior art, the multi-level scientific research topic mining method based on text embedding vector clustering has the following advantages:

according to the method, the article collection is input, the article collection is clustered to obtain the keyword semantic clusters, article distribution vectors are constructed based on the keyword semantic clusters, then the article clusters are clustered to obtain the article clusters, and the construction of the article clusters is completed; further constructing an article cluster index set tree based on the keyword semantic clusters and the article clusters; constructing a multi-level topic network based on the article cluster index set tree, and determining scientific research topic relation; the multi-level theme can be built more effectively and efficiently, and the built multi-level theme presents a richer network structure.

Drawings

FIG. 1 is a flow chart of a multi-level scientific research topic mining method of the present invention;

FIG. 2 is a flow chart of the construction of an article cluster of the present invention;

FIG. 3 is a construction diagram of an article cluster index set tree of the present invention;

FIG. 4 is a flow chart of the invention for determining subject relationships of scientific research.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. The specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The embodiment of the invention provides a multi-level scientific research topic mining method based on text embedded vector clustering, wherein article clusters are a reasonable representation form of cross-discipline scientific research topics, each article cluster is further represented by a series of keyword semantic clusters, and the semantic clusters are regarded as general scientific research topics.

The topic keywords under the same keyword semantic cluster have similar semantics, and the semantic similarity of the topic keywords is strong enough to be distinguished from other semantic clusters. By way of example, a "neural network" is a general scientific topic that contains keywords such as "nerve", "neuron", "convolutional neural network", "recurrent neural network", etc., that have semantic similarity.

The article clusters are composed of articles describing similar scientific content, and the clusters must contain at least two keyword semantic clusters to represent their interdisciplinary nature. For example, the article cluster "neural network-based image segmentation technique" is a interdisciplinary subject generated by combining the general subjects "neural network" and "image processing". In the present invention, the article clusters are represented as distribution vectors over a limited number of keyword semantic clusters. If an article cluster contains only one keyword semantic cluster with a probability of 1, the article cluster should be regarded as a general scientific research topic. In addition, if two article clusters are crowded with the same keyword semantic clusters, but the probability distributions across these clusters are significantly different, they correspond to two different interdisciplinary subjects or directions under the same generic subject fusion.

Specifically, in this embodiment, a multi-level scientific research topic mining method based on text embedding vector clustering is provided as shown in fig. 1-4, including:

step 1: clustering the article set by inputting the article set to obtain a keyword semantic cluster; then, constructing article distribution vectors based on the keyword semantic clusters and clustering to obtain article clusters, so as to complete the construction of the article clusters;

Specifically, clustering to obtain keyword semantic clusters includes: calculating by using a pre-training language model to obtain an embedded vector of each keyword appearing in the article set each time; and clustering the embedded vectors to obtain a certain number of keyword semantic clusters.

Further, each keyword semantic cluster represents a sufficiently broad and independent general subject of scientific research. Subject keywords under such subjects tend to exhibit similarity in semantics or context, due in large part to their consistency in concept and naming patterns. The method comprises the steps of firstly, calculating and obtaining an embedded vector of each keyword in an article set by using a pre-training language model. After that, if the number of occurrences of a keyword is greater than a predefined threshold, then all embedded vectors of the keyword need to be first X-Means clustered. The purpose of this is to find ambiguous keywords and to get their embedded vectors under different semantics. After obtaining the embedded vectors of all keywords, clustering the embedded vectors (the clustering algorithm and the clustering quantity can be predefined) to obtain a certain quantity of keyword semantic clusters.

Specifically, the article distribution vector includes: calculating cosine similarity between the article embedded vector and the central vector of each keyword semantic cluster; the similarity score is regularized to obtain a probability distribution vector.

Further, the core of this stage is to establish associations between each article and different keyword semantic clusters. In order to better measure the correlation between articles and keyword semantic clusters, the framework adopts two strategies. One is to count the number of occurrences of clusters to which each keyword appearing in an article belongs, and then regularize to obtain a probability distribution vector. Another strategy is to calculate cosine similarity between the article embedded vector and the center vector of each keyword semantic cluster, and regularize the similarity score to obtain a probability distribution vector. The present invention eventually adopts the second strategy because it exhibits a higher robustness to text of different lengths.

The regularization step in both of the above strategies is critical to finding unique and well-separated clusters of documents. Typical choices for normalization methods include L2 normalization and Softmax functions. These methods tend to produce dense distribution vectors of documents that are more difficult to partition into well-separated clusters. In addition, the resulting dense vector for each centroid of a document cluster also tends to contain trivial and noisy elements that do not reflect the true interdisciplinary nature of the study topic. This will further prevent the determination of the superior-to-child relationship at stage MLTD of our framework.

To solve this problem, the present invention employs Sparsemax techniques for regularization of the distribution vectors. The technique can eliminate most of noise distribution values in the original vector and finally generate a sparse vector which only keeps important distribution values. Such vectors generally exhibit better clustering, enabling clusters of articles to be more easily discovered.

Specifically, clustering to obtain an article cluster includes: and applying an X-Means clustering algorithm to sparse distribution vectors of the articles to obtain different article clusters.

Step 2: constructing an article cluster index set tree based on the keyword semantic clusters and the article clusters;

Constructing an article cluster index set tree, which comprises the following steps:

For each set to be inserted, traversing all nodes at the next level of the root node in the existing tree; the basic principle of building a tree on an index set is to insert them into an existing tree from small to large in the index set, specifically, for each set to be inserted, the algorithm will first traverse all nodes in the existing tree at the next level to the root node (i.e., first level nodes).

Matching the target set with the index set corresponding to the node; if the target set contains an index set for a node, then the path under that node is a candidate path, i.e., the path may contain the parent topic of the target set/article cluster/topic. If none of the nodes meets the inclusion condition, the target set is inserted under the root node as a new level one node. Then, for each candidate path, the algorithm traverses all nodes on the path in a top-to-bottom order and calculates the similarity of the currently traversed node and the target set.

If the similarity of the two is lower than a certain threshold, the traversal is stopped, and the target index set is inserted as a child node of the last node.

For how to calculate the similarity between the current node and the target set, the algorithm adopts the following formula:

For the selection of similarity threshold, the algorithm does not use a fixed threshold approach, but rather uses a approach that selects a lower threshold at an early stage of the traversal (i.e., first and second levels), and then uses a higher threshold. This way it is ensured that the measure of similarity between the generic topic and the interdisciplinary topic does not suffer from erroneous decision-making due to the initial threshold being too high.

In addition, since one target set can be inserted into a plurality of candidate paths, in order to ensure that all target sets and their parent nodes can be effectively searched when a subject network is built later, an algorithm sets a communicated pointer chain between the target set nodes inserted successively and between the target set nodes and the parent nodes so as to achieve the purpose of quick search.

In addition, a target set may be identical to a node already present in a tree. This means that there are at least two article clusters with the same set of keyword sense clusters, but their distribution vectors over the clusters are significantly different. In particular embodiments, this represents that two interdisciplinary topics are generated from the same general topic, but with different focus. For example, both interdisciplinary topics "Topic-aware NLM" and "NTM WITH PRE-TRAINED NLM" were generated under the generic topics "Neural Topic Mode l (NTM)" and "Neural Language Mode l (NLM)", but with quite different directions of investigation.

Finally, by continuously inserting the target index set into the tree, the algorithm will obtain an index set tree that records the subject relationships of the scientific research.

Step 3: and constructing a multi-level topic network based on the article cluster index set tree, and determining the scientific research topic relation.

Specifically, after the index set tree is built, the algorithm begins to build a multi-level topic network based on it.

In the network, nodes at a first level correspond to different general subject matters, while nodes at a second or lower level represent different article clusters (i.e., scientific/subject cross-subject matters).

Edges in the network then represent parent-child relationships between topics extracted from the index set tree. To build the network, the algorithm traverses each index set present in the tree in descending order from large to small.

For a current index set, the algorithm finds its location and corresponding parent node in the tree through its pointer chain. After that, the algorithm performs a union operation on the index sets corresponding to all the parent nodes, and calculates the difference set between the union and the target index set. The computed difference set has a feature that it does not contain any existing index set.

In a specific embodiment, the difference set only includes index numbers corresponding to the general subject matter, i.e., the parent node of the target index set includes the general subject matter. For example, assuming that one target index set is {1,2,4,5,8,9}, and it has two parent index sets {1,4,5} and {2,4}, then the remaining set through the difference set calculation is {8,9}. The set is not an existing node in a tree because there must be a target index set that has been inserted under it if it is. Thus, one conclusion is that, based on the existing provided text set, only two generic scientific topics {8} and {9} can become the remaining parent nodes of the target index set. So in this example, the parent topics for the target index set {1,2,4,5,8,9} are {1,4,5}, {2,4}, {8} and {9}.

The method is different from the current single-level topic mining based on text semantic vector clustering and the multi-level topic mining of a tree structure, and the scheme of the invention realizes the multi-level network structure topic mining based on semantic vectors and distribution vectors. The method provides that semantic clusters are used for representing more general academic topics for the first time, and article clusters are used for representing more refined trans-discipline fusion topics. Through carrying out association modeling based on vector sparsification and ensemble similarity matching on the two types of topics, the method solves the problem of how to determine father-son subordinate relations between the two types of topics and inside each topic for the first time, thereby automatically constructing a multi-level topic network.

Unlike the complexity and inefficiency of traditional statistical topic modeling and neural network topic modeling, the scheme of the invention has high universality, flexibility and efficiency. It can support topic mining at any text granularity and has very low technology and application thresholds.

In summary, an article set is input, the article set is clustered to obtain a keyword semantic cluster, article distribution vectors are constructed based on the keyword semantic cluster, then the article cluster is clustered to obtain the article cluster, and the construction of the article cluster is completed; constructing an article cluster index set tree based on the keyword semantic clusters and the article clusters; constructing a multi-level topic network based on the article cluster index set tree, and determining scientific research topic relation; the multi-level theme can be built more effectively and efficiently, and the built multi-level theme presents a richer network structure.

In the embodiment, a brand new multi-level topic mining method based on text embedded vector clustering is provided for the first time. The method comprises the steps of firstly dividing keywords in a text set into a plurality of semantic clusters in an embedded clustering mode. Each text in the set is then represented as a (probability) distribution vector over the plurality of keyword semantic clusters. By further clustering these distribution vectors, a plurality of text clusters exhibiting different academic topics can be obtained. Finally, the method systematically identifies relationships between text clusters using vector sparsification and subset matching to create a multi-level topic network. Compared with the existing multi-level topic modeling model based on a statistical method, such as hLDA, the method can more effectively and efficiently complete the construction of multi-level topics, and the constructed multi-level topics show a richer network structure. In addition, the application of the Text Embedding Clustering (TEC) method in the topic relation mining is expanded from a single-layer tree form to a multi-level network form for the first time. Compared with the prior methods, the qualitative and quantitative analysis method provided by the invention has more excellent performance in terms of subject matter consistency and subject matter diversity.

Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims

1. A multi-level scientific research topic mining method based on text embedding vector clustering is characterized by comprising the following steps:

clustering the article set by inputting the article set to obtain a keyword semantic cluster; then, constructing article distribution vectors based on the keyword semantic clusters and clustering to obtain article clusters, so that the construction of the article clusters is completed;

constructing an article cluster index set tree based on the keyword semantic clusters and the article clusters;

And constructing a multi-level topic network based on the article cluster index set tree, and determining the scientific research topic relation.

2. The method for mining multi-level scientific research topics based on text embedding vector clustering according to claim 1, wherein the clustering is performed to obtain keyword semantic clusters, and the method comprises the following steps:

Calculating by using a pre-training language model to obtain an embedded vector of each keyword appearing in the article set each time;

And clustering the embedded vectors to obtain a certain number of keyword semantic clusters.

3. The method for mining multi-level scientific research topics based on text embedding vector clustering according to claim 2, wherein the article-making distribution vector comprises:

calculating cosine similarity between the article embedded vector and the central vector of each keyword semantic cluster;

The similarity score is regularized to obtain a probability distribution vector.

4. The method for mining multi-level scientific research topics based on text embedding vector clustering according to claim 3, wherein the clustering to obtain article clusters comprises: and applying an X-Means clustering algorithm to sparse distribution vectors of the articles to obtain different article clusters.

5. The method for mining multi-level scientific research subjects based on text embedded vector clustering according to claim 1, wherein the constructing an article cluster index set tree comprises:

for each set to be inserted, traversing all nodes at the next level of the root node in the existing tree;

Matching the target set with the index set corresponding to the node;

By continuously inserting the target index set into the tree, an index set tree in which the subject relationships of the scientific research are recorded is obtained.

6. The method for mining multi-level scientific research subjects based on text-embedded vector clustering according to claim 5, wherein the matching the target set with the index set corresponding to the node includes:

If the target set comprises an index set of a certain node, the path under the node is a candidate path; if none of the nodes meets the inclusion condition, the target set is inserted under the root node as a new primary node;

For each candidate path, traversing all nodes on the path according to the sequence from top to bottom, and calculating the similarity between the currently traversed node and the target set;

if the similarity of the two is lower than a certain threshold value, stopping traversing, and inserting the target index set into the child node which is the last node.

7. The multi-level scientific research topic mining method based on text embedded vector clustering according to claim 6, wherein the formula for calculating the similarity between the currently traversed node and the target set is as follows:

8. The method for mining multi-level scientific research subjects based on text-embedded vector clustering according to claim 7, wherein the selecting mode of the threshold value comprises: the lower threshold is chosen in the early stages of the traversal, after which the high threshold is employed.

9. The method for mining multi-level scientific research topics based on text embedded vector clustering according to claim 6, wherein constructing a multi-level topic network based on the article cluster index set tree comprises: the nodes of the first layer are corresponding to different general subject matters, and the nodes of the second layer or lower layer represent different scientific research/subject crossing subject matters.

10. The method for mining multi-level scientific research topics based on text-embedded vector clustering according to claim 9, wherein the determining the scientific research topic relation comprises:

traversing an article family index set from large to small, and finding all father nodes of the index set in a tree;

and taking all the father node index sets as union sets, and calculating a difference set between the target index set and the union sets, wherein the difference set comprises index numbers corresponding to the general subject matters, namely, the father node of the target index set comprises the general subject matters.