CN117113982A

CN117113982A - Big data topic analysis method based on embedded model

Info

Publication number: CN117113982A
Application number: CN202310337528.7A
Authority: CN
Inventors: 刘向阳; 周康桥
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-11-24

Abstract

The invention relates to a big data topic analysis method based on an embedding model, which comprises the steps of firstly carrying out Sentence embedding representation on each Chinese text data after pretreatment by using a Sentence-BERT model, then carrying out dimension reduction treatment on vectors after Sentence embedding by using a projection dimension reduction UMAP algorithm, then clustering the vectors after dimension reduction by using an HDBSCAN clustering algorithm, and selecting each Chinese word with the highest c-TF-IDF score to represent each topic class based on the distribution of each Chinese text in a target Chinese set to the corresponding topic class; finally, word embedding representation is carried out on the subject words by using the DSG model, the similarity between different subject words and the similarity between different subject classes are calculated, and further fluctuation detection is carried out on whether the newly-appearing subject classes exist or not. The whole scheme design has higher theme consistency and theme diversity, can timely and accurately detect new hot spot appeal theme, and gives early warning.

Description

Big data topic analysis method based on embedded model

Technical Field

The invention relates to a big data topic analysis method based on an embedded model, and belongs to the technical field of big data topic detection analysis.

Background

At present, urban management in China is in the transition from single control type to co-treatment and co-pipe type, and citizens participate in urban management and become key elements for promoting urban management system and modernization of management capability. Government hotlines are one of important channels for administrative and civil interactions, and government hotlines have been opened in many local markets. Citizens can make a call through the hot wire to respond to the civil related appeal and the problems and track the handling result, and the hot wire can better help citizens to solve the problems and improve the satisfaction of the citizens.

The application of the government service hotline is mainly based on basic civilian problems in a microscopic level, so that the citizen appeal is absorbed, and the social problems concerned by the citizen are solved. On a macroscopic level, the citizen service hotline platform is used as a bridge for effective communication between related government departments and citizens, and has important significance in innovative social management and promotion of service government construction. With the gradual popularization of the hot line system, the daily work orders in many places reach tens of thousands, and even millions of annual managed work orders can be achieved. Especially under the condition of facing sudden epidemic situation, the requirements on government affair hotline to treat civil problems are higher, and the three layers of efficiency, speed and effect are involved. Through the mining treatment and theme extraction of the social hot spot and the difficult problem, the hotline is promoted to promote the accurate service level, and the service target of 'not complaining about first-order' is gradually reached.

After citizen reaction requirements are effectively handled and solved, the problem-solving work order is checked. But for analysts, the massive demands can be arranged and analyzed, the problem classification and the special analysis of the hot spot problem are carried out, and the deep analysis and the visual display of the sensitive words are carried out on the hot spot words by means of the map function and the text mining clustering function. The directions of the topics of the work orders are grabbed, so that the direction of the civil problem which is generally concerned by citizens can be reflected, meanwhile, intelligent early warning is carried out on the complaint work orders of different return ports, and reference and guidance are better provided for decision makers.

The content of the government hot line appeal is subjective description of the appeal problem by a large number of citizens, and the text length is different, but the contained information is relatively rich. How to mine the hidden semantic structure information in these texts is a big research difficulty in the field of natural language processing and text retrieval at present. The topic model is a popular and efficient method of mining semantic structure information in text through high-order co-occurrence patterns in documents between words. The technology of topic models has been applied to various research areas and has achieved good results. Although the topic modeling method of LDA has been widely applied to various tasks of natural language processing, its effect is often not ideal when modeling topics of short text such as government hotlines, and there is no complete and efficient system detection method for detecting the volatility of a topic after topic extraction.

Disclosure of Invention

The invention aims to solve the technical problem of providing a big data topic analysis method based on an embedded model, which adopts a brand new model structure and a control strategy and can efficiently and accurately realize topic class analysis and hotspot topic class fluctuation detection of a corpus in a target.

The invention adopts the following technical scheme for solving the technical problems: the invention designs a big data topic analysis method based on an embedded model, which realizes topic class analysis of a corpus in a target according to the following steps A to E;

step A, data preprocessing operation is respectively carried out on each item mark Chinese text in the target Chinese set, each Chinese word segmentation corresponding to each item mark Chinese text is obtained, and then the step B is carried out;

b, using a Sentence-BRTT model which is finely tuned based on a BERT model pre-training model to respectively perform Sentence embedding representation on each Chinese word corresponding to each item-labeled Chinese text, obtaining high-dimensional Sentence enabling vectors corresponding to each item-labeled Chinese text, and then entering the step C;

step C, performing dimension reduction processing on the high-dimension sentence enabling vectors of the Chinese texts of each item mark by utilizing unified manifold approximation and projection dimension reduction UMAP algorithm according to the preset neighborhood size information and the preset low-dimension target space to obtain low-dimension sentence enabling vectors corresponding to the Chinese texts of each item mark respectively, and then entering step D;

Step D, clustering the low-dimensional sentence enabling vectors corresponding to the Chinese texts of each item mark according to the preset minimum generation cluster size by applying an HDBSCAN clustering algorithm based on the level and the density to obtain corresponding clusters, namely, the clusters are used as topic classes corresponding to the target Chinese text sets, and then entering the step E;

and E, based on each low-dimensional sentence enabling vector included in each topic class corresponding to the topic in the target, obtaining each Chinese word of the corresponding item-marked Chinese text included in each topic class, then applying a c-TF-IDF algorithm to obtain c-TF-IDF scores corresponding to each Chinese word in each topic class, obtaining preset m Chinese words with the largest c-TF-IDF score in each topic class, and forming each topic word in each topic class corresponding to the topic in the target, thereby realizing topic class analysis of the topic in the target.

As a preferred technical scheme of the invention: the method also comprises a step F, wherein after the step E is executed, the step F is entered;

and F, word embedding representation of the subject words based on the DSG model, obtaining high-dimensional vectors respectively corresponding to the subject words in each subject class corresponding to the corpus in the target, and combining each historical subject class in the historical subject class set and each subject word in the historical subject class to realize new subject class analysis in each subject class corresponding to the corpus in the target compared with the historical subject class set through similarity calculation.

As a preferred technical scheme of the invention: in the step A, firstly, removing a target Chinese text expressed by a non-text expressive person in a target Chinese text set; then, aiming at each target Chinese text in the target Chinese set, executing Chinese fine-granularity word segmentation to obtain each Chinese word segmentation corresponding to the target Chinese text, removing Chinese stop words in the Chinese word segmentation, and updating each Chinese word segmentation corresponding to the target Chinese text; and further obtaining each Chinese word segmentation corresponding to each item mark Chinese text.

As a preferred technical scheme of the invention: the step B comprises the steps of B1 to B4;

b1, based on a BERT model taking Chinese segmentation as input and an enabling vector corresponding to the Chinese segmentation as output, aiming at a BERT model output end, butting an average bearing layer, constructing a Sentence-BRTT model taking each Chinese segmentation corresponding to a Chinese text as input and taking a high-dimensional Sentence enabling vector corresponding to the Chinese text as output, and then entering a step B2;

and step B2, obtaining high-dimensional Sentence enabling vectors corresponding to two sentences in a Sentence pair respectively by using a Sentence-BRTT model, and combining a twin network classification objective function for detecting whether meanings between the two high-dimensional Sentence enabling vectors are the same as a classification result, wherein the twin network classification objective function is as follows:

O＝softmax(w(u,v,|u-v|))

Constructing a combined network to be trained, and then entering a step B3; wherein u and v respectively represent high-dimensional sentence enabling vectors corresponding to two sentences in the sentence pair, w represents weight parameters, and O represents a classification result for detecting whether meanings between u and v are the same or not;

b3, performing fine tuning training on the combined network to be trained based on Sentence pairs with known meanings of labels in the text classification public data set to obtain a trained Sentence-BRTT model, namely a Sentence-BRTT model which is fine-tuned based on a BERT model pre-training model, and then entering a step B4;

step B4. uses a Sentence-BRTT model finely tuned based on a BERT model pre-training model to perform Sentence embedding representation on each Chinese word segment corresponding to each item-labeled Chinese text, obtain high-dimensional Sentence enabling vectors corresponding to each item-labeled Chinese text, and then enter step C.

As a preferred technical scheme of the invention: the step C comprises the following steps C1 to C3;

step C1, obtaining the distance between every two high-dimensional sentence enabling vectors based on the high-dimensional sentence enabling vectors corresponding to the Chinese text of each item mark respectively, and then entering step C2;

step C2., respectively aiming at each high-dimensional sentence casting vector, constructing a weighted k neighbor graph corresponding to the high-dimensional sentence casting vector based on preset k other high-dimensional sentence casting vectors closest to the high-dimensional sentence casting vector, further obtaining a weighted k neighbor graph corresponding to each high-dimensional sentence casting vector, and then entering step C3;

Step C3., according to a preset minimum distance in a preset low-dimensional target space, aiming at each high-dimensional sentence enabling vector, taking a minimized cross entropy cost function as a target, applying a projection dimension reduction UMAP algorithm to reduce the dimension of a weighted k neighbor graph corresponding to the high-dimensional sentence enabling vector to the preset low-dimensional target space in a projection manner, obtaining a corresponding low-dimensional sentence enabling vector, and further obtaining the low-dimensional sentence enabling vector corresponding to each item-labeled Chinese text.

As a preferred technical scheme of the invention: the step D comprises the following steps D1 to D6;

step D1, obtaining the distance between every two low-dimensional sentence enabling vectors based on the low-dimensional sentence enabling vectors corresponding to the Chinese texts of each item label respectively, and then entering step D2;

step D2., for each low-dimensional sentence reducing vector, obtaining the m-th distance core in the sequence based on the sequence from small to large of the distances between the low-dimensional sentence reducing vector x and other low-dimensional sentence reducing vectors _m (x)＝d(x，N ^m (x) A core distance corresponding to the low-dimensional sentence sounding vector x is formed; obtaining the core distance corresponding to each low-dimensional sentence enabling vector, and then entering a step D3; wherein N is ^m (x) Represents other low-dimensional sentence sounding vectors, d (x, N) ^m (x) X and N are represented by ^m (x) A distance function between;

step D3, combining the distances between every two low-dimensional sentence sounding vectors based on the core distances respectively corresponding to the low-dimensional sentence sounding vectors according to the following formula:

d _mreach-m (a，b)＝max{cora _m (a)，core _m (b)，d(a，b)}

obtaining the inter-arrival distance between every two low-dimensional sentence enabling vectors, and then entering a step D4; wherein, the cora _m (a)、core _m (b) Respectively represent a low-dimensional sentence eCore distance corresponding to the packing vector a and core distance corresponding to the low-dimensional sentence packing vector b, d (a, b) represents distance between the low-dimensional sentence packing vector a and the low-dimensional sentence packing vector b, max { · } represents the maximizing function, d _mreach-m (a, b) represents the inter-arrival distance between the low-dimensional sentence casting vector a and the low-dimensional sentence casting vector b;

step D4., constructing a minimum spanning tree with all vertices connected by the connected edges and the minimum sum of all the distances of the connected edges by taking each low-dimensional sentence enabling vectors as each vertex and the mutual reaching distance between every two low-dimensional sentence enabling vectors as the distance of the connected edges between the vertices, and then entering step D5;

step D5., based on all the connected edges in the minimum spanning tree, sequentially selecting all the connected edges according to the sequence from small to large according to the corresponding distances, classifying the objects connected at the two ends of the connected edges into the same class clusters to obtain all the class clusters, and then entering step D6;

Step D6. removes class clusters with the size smaller than the preset minimum generated class cluster size in the whole class clusters to obtain the rest class clusters, namely, the topic classes corresponding to the corpus in the target.

As a preferred technical scheme of the invention: the step E comprises the steps E1 to E3;

step E1, based on each low-dimensional sentence enabling vector included in each topic class corresponding to the target corpus, each Chinese word segmentation of each item-labeled Chinese text included in each topic class is obtained, each Chinese word segmentation of each item-labeled Chinese text in each topic class is then aggregated for each topic class to form a single document corresponding to the topic class, further, each document corresponding to each topic class corresponding to the target corpus is obtained, and step E2 is entered;

step E2, aiming at each topic class corresponding to the corpus in the target, further aiming at each Chinese word in the document corresponding to the topic class, respectively, according to the following formula:

obtaining c-TF-IDF scores corresponding to the Chinese segmentation words in the documents corresponding to the topic classes respectively, further obtaining c-TF-IDF scores corresponding to the Chinese segmentation words in the topic classes corresponding to the target corpus respectively, and then entering a step E3; wherein g _c，t Representing the number of times a Chinese word t in a topic class c appears in the documents of the topic class c, G _c The number of Chinese segmentations in the document representing topic class c, tf _t Representing the number of times that Chinese word t appears in the documents of all topic classes, A representing the average number of Chinese word t in the documents of each topic class;

step E3. obtains preset J Chinese segmentation words with maximum c-TF-IDF scores in each topic class based on the c-TF-IDF scores respectively corresponding to each Chinese segmentation word in each topic class corresponding to the target corpus, and forms each topic word in each topic class corresponding to the target corpus, so as to realize topic class analysis of the target corpus.

As a preferred technical scheme of the invention: the step F comprises the steps of F1 to F3;

f1, word embedding representation of the subject words based on a DSG model, obtaining high-dimensional vectors of each subject word corresponding to preset dimensions in each subject class corresponding to a corpus in a target, and then entering step F2;

step F2., respectively aiming at each topic class corresponding to the target corpus, executing the following steps F2-1 to F2-4 to obtain the similarity between each topic class corresponding to the target corpus and the historical topic class set respectively, forming a classification index of each topic class corresponding to the target corpus, and then entering step F3;

F2-1, according to each historical topic class in the historical topic class set and each topic word in the historical topic class, aiming at each topic word in the topic class i corresponding to the corpus in the target, respectively, according to the following formula:

S _{i_k_i_l} ＝model.similarity(old_topics_word[k][l]，new_topics_word[i][j])

calculating word similarity S between jth subject word in subject class i corresponding to the corpus in the target and the ith subject word in the kth historical subject class _{i_k_j_l} Then enter step F2-2; wherein J is more than or equal to 1 and less than or equal to J, J represents the number of subject words in the subject class corresponding to the corpus in the target, K is more than or equal to 1 and less than or equal to K, K represents the number of history subject classes in the history subject class set, L is more than or equal to L and less than or equal to L represents the number of subject words in the history subject class; old_topics_word [ k ]][l]Representing the first subject term in the kth historical subject class, new_topics_word [ i ]][j]The j-th subject term in the subject class i corresponding to the corpus in the target is represented, the model similarity (·) represents a similarity function, and a cosine measurement formula is adopted;

f2-2, based on L is less than or equal to L, the following formula is adopted:

S _{i_k_j} ＝maxS _{i_k_j_l}

calculating similarity S between j-th subject word and k-th historical subject class in subject class i corresponding to the corpus in the target _{i_k_j} Then enter step F2-3;

f2-3, based on J being more than or equal to 1 and J being more than or equal to J, the following formula is adopted:

calculating and obtaining similarity S between topic class i corresponding to the corpus in the target and kth historical topic class _{i_k} Then enter step F2-4;

f2-4, based on that K is more than or equal to 1 and less than or equal to K, the following formula is adopted:

S _i ＝maxS _{i_k}

calculating and obtaining similarity S between topic class i corresponding to the corpus in the target and historical topic class set _i ；

Step F3., according to a preset classification threshold lambda in the value range from 0 to 1, based on the classification indexes of the topic classes corresponding to the corpus in the target, if the classification indexes of the topic classes are less than or equal to lambda, judging that the topic class is a new topic class compared with the historical topic class set, and adding the topic class and each topic word in the topic class into the historical topic class set; if the classification index of the topic class is more than lambda, judging that the topic class is not a new topic class compared with the historical topic class set.

As a preferred technical scheme of the invention: the distance between the vectors is calculated by adopting any one of sine distance, cosine distance and Euler distance.

Compared with the prior art, the big data topic analysis method based on the embedded model has the following technical effects:

(1) Compared with the traditional LDA topic modeling method, the big data topic analysis method based on the embedded model has the advantages that the word embedded representation technology is applied, the pre-trained BERT word embedded vector technology is considered to be applied to the topic model, and the topic analysis performance is improved;

(2) In the big data topic analysis method based on the embedded model, when the BERT word is embedded, aiming at the problem that the BERT is not suitable for semantic similarity search and an unsupervised task, the method applies Sentence-BERT to carry out fine adjustment on the pre-trained BERT: searching and using a semanteme Network (twin Network) fine-tuning pre-trained BERT (SBERT) model to generate a semantic high-dimensional Sentence empedding vector so that the semantic similar sentences, wherein the semantic high-dimensional Sentence empedding vector distance is relatively close, and therefore cosine similarity, manhattan distance, euclidean distance and the like can be used for finding out the semantic similar sentences;

(3) In the big data topic analysis method based on the embedded model, when the dimension of the high-dimension sentence casting vector is reduced, the UMAP dimension reduction method is used, and compared with other dimension reduction technologies such as PCA (principal component analysis), the UMAP method maintains the local and global structures of data when the dimension is reduced, which is important for representing the semantics of text data;

(4) In the big data topic analysis method based on the embedded model, as the UMAP dimension reduction method can keep some original high embedded structures, the use of the HDBSCAN to search for high-density clusters (namely popular appeal topic classes) is significant, compared with the DBSCAN, the HDBSCAN has the greatest advantages that the field radius R and the MinPoints are not selected manually, most of the time only the size of the minimum generated cluster is needed, and the algorithm can automatically recommend the optimal cluster result;

(5) In the big data topic analysis method based on the embedded model, after each Chinese text in the obtained target Chinese set is distributed to one cluster, the next step is to use a class-based TF-IDF called c-TF-IDF to obtain topic class representation, and the topic class extracted by the whole scheme design has higher topic class consistency and topic class diversity;

(6) According to the big data topic analysis method based on the embedded model, topic class fluctuation detection based on a word embedding mode is independently designed, meanwhile, word embedding is carried out on the word embedding mode by adopting a direct skip-gram (DSG) model instead of a skip-gram (SG), the similarity between different topic words and the similarity between different topic classes can be effectively calculated, topic class fluctuation detection is carried out according to the similarity score, and finally, the newly-appearing hot-spot resort topic class can be effectively detected.

Drawings

FIG. 1 is a flow chart of a big data topic analysis method based on an embedded model designed by the present invention;

FIG. 2 is a schematic diagram of content data of a government hot line complaint in a government service platform in accordance with the present invention;

FIG. 3 is a schematic flow chart of the algorithm of the data preprocessing in the step A of the invention;

FIG. 4 is a diagram showing data after preprocessing in step A of the present invention;

FIG. 5 is a diagram of the fine tuning twin network in step B of the present invention;

FIG. 6 is a flowchart of the HDBSCAN algorithm in step D of the present invention;

FIG. 7 is a visual representation of clustering after dimension reduction of data to 2 dimensions in an application of step D of the present invention;

FIG. 8 is a graph of the calculated partial classification cluster c-TF-IDF score in an application of the present invention in the implementation of step E;

FIG. 9 is a diagram of the overall subject matter of the subject matter after a certain day for step E implementation of the present invention;

FIG. 10 is a flowchart of the subject volatility detection algorithm of step F of the present invention;

FIG. 11 is a graph showing the results of the subject volatility test in the application of step F of the present invention.

Detailed Description

The following describes the embodiments of the present invention in further detail with reference to the drawings.

In practical application, such as application to government service hotlines, the method for analyzing big data topics based on an embedded model is used for acquiring Chinese texts of hotline appeal contents of daily citizens and enterprises in a daily unit to form a target corpus, and specifically, the following steps A to E are executed according to the design shown in FIG. 1 to realize topic class analysis of the target corpus as shown in FIG. 2.

Step A, respectively aiming at each item mark Chinese text in the target Chinese set shown in FIG. 2, executing data preprocessing operation according to the diagram shown in FIG. 3, specifically, firstly removing the target Chinese text expressed by the non-text expressive person in the target Chinese set by using the regular expression in the re module of python; then, respectively aiming at each target Chinese text in the target Chinese set, executing Chinese fine-granularity word segmentation by utilizing a jieba module in python to obtain each Chinese word segmentation corresponding to the target Chinese text, removing Chinese stop words in the Chinese word segmentation, and updating each Chinese word segmentation corresponding to the target Chinese text; and further obtaining each Chinese word segmentation corresponding to each item mark Chinese text, wherein partial results after data preprocessing are shown in fig. 4, and then the step B is carried out.

Regarding the removal of stop words, the disclosed Ha Gong stop word list, bai Gong stop word list and Sichuan university machine intelligent laboratory stop word list are mainly utilized for judging, namely if Chinese word segmentation is in the Chinese stop list, the word segmentation is considered to be stop words, the stop words are removed, otherwise, the word segmentation is not considered to be stop words, and the Chinese word segmentation is reserved and is not processed.

And B, using a Sentence-BRTT model finely tuned based on a BERT model pre-training model to respectively perform Sentence embedding representation on each Chinese word corresponding to each item-labeled Chinese text, obtaining high-dimensional Sentence enabling vectors corresponding to each item-labeled Chinese text, and then entering the step C.

In practical applications, the step B is specifically designed to execute steps B1 to B4 according to fig. 5.

B1, based on a BERT model taking Chinese segmentation as input and an enabling vector corresponding to the Chinese segmentation as output, aiming at a BERT model output end, butting an average bearing layer, constructing a Sentence-BRTT model taking each Chinese segmentation corresponding to a Chinese text as input and taking a high-dimensional Sentence enabling vector corresponding to the Chinese text as output, and then entering a step B2.

The average pooling layer is used for averaging the ebedding vectors of all Chinese word segmentation in the same Chinese text after data preprocessing in each dimension, so that each Chinese text generates a high-dimensional sentence ebedding vector with a fixed size.

O＝softmax(w(u，v，|u-v|))

constructing a combined network to be trained, and then entering a step B3; wherein u and v respectively represent high-dimensional sentence mapping vectors corresponding to two sentences in the sentence pair, w represents weight parameters, and O represents a classification result for detecting whether meanings between u and v are the same.

And B3, performing fine tuning training on the combined network to be trained based on each group of sentences with known meanings of the labels in the text classification public data set to obtain a trained Sentence-BRTT model, namely a Sentence-BRTT model which is fine-tuned based on a BERT model pre-training model, and then entering step B4.

Step B4. uses a Sentence-BRTT model which is finely tuned based on a BERT model pre-training model to perform Sentence embedding representation on each Chinese word corresponding to each item-labeled Chinese text, so as to obtain high-dimensional Sentence enabling vectors corresponding to each item-labeled Chinese text, wherein in the application, each Chinese text is represented as a 384-dimensional high-dimensional Sentence enabling vector, and then step C is entered.

And C, performing dimension reduction processing on the high-dimension sentence enabling vectors of the Chinese texts of each item mark by utilizing unified manifold approximation and projection dimension reduction UMAP algorithm according to the preset neighborhood size information and the preset low-dimension target space to obtain the low-dimension sentence enabling vectors of the Chinese texts of each item mark, and then entering the step D.

The method comprises the steps of maintaining a local structure and a global structure of data during dimension reduction, and effectively representing semantic information of Chinese text data, wherein an initialization parameter n_neighbors=25, the local neighborhood size of a projection dimension reduction UMAP algorithm is 25, and n_neighbors are parameters for controlling the local and global structures in the text data; initializing parameter n_components=10, which indicates that the target dimension after the projection dimension reduction UMAP algorithm reduces the dimension is 10, and n_components are the data dimensions to be transferred to the clustering model; meanwhile, when the distance between different sample points is measured, the cosine measurement distance is adopted, namely, the metric= 'cosine'.

The projected dimension-reduction UMAP algorithm is used for dimension reduction, compared with other dimension-reduction technologies such as PCA (principal component analysis), the projected dimension-reduction UMAP algorithm maintains local and global structures of data during dimension reduction, which is important for representing semantics of text data, in practical application, the step C is specifically designed to execute the following steps C1 to C3, wherein the projected dimension-reduction UMAP algorithm needs to make clear the appearance of a high-dimensional sentence reducing vector in a high-dimensional space before mapping the high-dimensional sentence reducing vector to a low-dimensional sentence reducing vector, namely, the following steps C1 to C2 are executed, and then the step C3 is continuously executed.

And step C1, obtaining the distance between every two high-dimensional sentence enabling vectors based on the high-dimensional sentence enabling vectors corresponding to the Chinese texts of the item labels, and then entering step C2.

Step C2. constructs a weighted k neighbor graph corresponding to each high-dimensional sentence casting vector based on the preset k other high-dimensional sentence casting vectors closest to the high-dimensional sentence casting vector, so as to obtain a weighted k neighbor graph corresponding to each high-dimensional sentence casting vector, and then enters step C3.

Step C3., according to a preset minimum distance in a preset low-dimensional target space, aims at minimizing a cross entropy cost function for each high-dimensional sentence enabling vector as follows:

and (3) projecting and reducing the dimension of the weighted k neighbor graph corresponding to the high-dimension sentence reducing vector to a preset low-dimension target space by using a projection dimension reducing UMAP algorithm to obtain the corresponding low-dimension sentence reducing vector, and further obtaining the low-dimension sentence reducing vector corresponding to each item of Chinese text.

Because the projection dimension-reduction UMAP algorithm retains some original high embedded structures, the HDBSCAN clustering algorithm is significant in finding high-density clusters (namely popular topics), compared with the DBSCAN, the HDBSCAN clustering algorithm has the greatest advantages that the field radius R and the MinPoints are not selected manually, the minimum generated cluster size is only selected in most of the cases, the algorithm can automatically recommend the optimal cluster result, a new distance measurement mode is defined, and the method is specifically set to execute the following step D.

And D, according to the size of the preset minimum generation class cluster, applying an HDBSCAN clustering algorithm based on the level and the density, clustering according to a low-dimensional sentence enabling vector corresponding to each item mark Chinese text respectively as shown in fig. 6, obtaining corresponding class clusters, namely, the class clusters serving as the subject corresponding to the target Chinese text, and then entering the step E.

In practical applications, the specific design of the step D is as follows steps D1 to D6.

Step D1, obtaining the distance between every two low-dimensional sentence sounding vectors based on the low-dimensional sentence sounding vectors respectively corresponding to the Chinese texts of each item label, and then entering step D2.

Step D2., for each low-dimensional sentence reducing vector, obtaining the m-th distance core in the sequence based on the sequence from small to large of the distances between the low-dimensional sentence reducing vector x and other low-dimensional sentence reducing vectors _m (x)＝d(x，N ^m (x) A core distance corresponding to the low-dimensional sentence sounding vector x is formed; obtaining the core distance corresponding to each low-dimensional sentence enabling vector, and then entering a step D3; wherein N is ^m (x) Represents other low-dimensional sentence sounding vectors, d (x, N) ^m (x) X and N are represented by ^m (x) Distance function between.

d _mreach-m (a，b)＝max{cora _m (a)，core _m (b)，d(a，b)}

obtaining the inter-arrival distance between every two low-dimensional sentence enabling vectors, and then entering a step D4; wherein the core _m (a)、cora _m (b) Respectively representing a core distance corresponding to the low-dimensional sentence sounding vector a and a core distance corresponding to the low-dimensional sentence sounding vector b, d (a, b) represents a distance between the low-dimensional sentence sounding vector a and the low-dimensional sentence sounding vector b, max { _mreach-m (a, b) represents the inter-arrival distance between the low-dimensional sentence casting vector a and the low-dimensional sentence casting vector b.

The distance between two low-dimensional sentence ebedding vectors after the dimension reduction is expressed by using the mutual reachable distance, and specifically, the distance between different sample points can be measured by adopting a Euclidean measurement formula, namely, metric= 'euclidean'. This leaves the distance between the sample points of the dense region unaffected, while the distance of the sample points of the sparse region from other sample points is amplified, which increases the robustness of the clustering algorithm to non-dense regions.

Step D4. uses each low-dimensional sentence casting vector as each vertex, the inter-reaching distance between every two low-dimensional sentence casting vectors is the distance between the connected edges between the vertices, and a minimum spanning tree is constructed in which all the vertices are connected by the connected edges and the sum of the distances of all the connected edges is minimum, and then step D5 is entered.

Regarding the minimum spanning tree, in practical application, specific low-dimensional sentence mapping vectors are vertices, weights of edges between any two points are mutually reachable distances, a threshold is set, the threshold is gradually reduced from high, any edges with weights exceeding the threshold are deleted (namely, mutually reachable distances are too large and do not belong to the same cluster), and the weighted graph is split. Wherein the threshold value is taken from a set constructed by the smallest spanning tree (smallest connected subgraph with smallest weight) of the weighted graph, so that deleting any edge from the set results in weighted graph splitting.

Step D5., based on all the connected edges in the minimum spanning tree, sequentially selecting all the connected edges according to the sequence from small to large according to the corresponding distances, classifying the objects connected to the two ends of the connected edges into the same class clusters to obtain all the class clusters, and then entering step D6.

In practical application, step D6 is to limit the minimum subtree, and the generated class cluster is controlled not to be too small, firstly, initializing the size parameter min_cluster_size (minimum cluster size) =30 of the minimum generated class cluster, wherein the min_cluster_size can adjust the number of topics, and the larger the number is, the smaller the number of the extracted topics is, otherwise, the more the number is; traversing the cluster tree from top to bottom, and checking whether the number of samples contained in the two sub-clusters is greater than 30 when each node is split; if the number of child samples in the left child and the right child is less than 30, deleting the node directly, and promoting the other child to be a father node; if the number of the two child samples is less than 30, deleting the two child samples, and enabling the current node not to split downwards; if both child samples are greater than or equal to 30, then normal splitting continues downward.

In the implementation application, the steps C to D are executed, after each high-dimensional sentence reducing vector is reduced to a low-dimensional sentence reducing vector with 2 dimensions, clustering is executed, and the result is shown in fig. 7, and then step E is entered.

And E, based on each low-dimensional sentence enabling vector included in each topic class corresponding to the target corpus, obtaining each Chinese word of each item-marked Chinese text included in each topic class, then applying a c-TF-IDF algorithm to obtain c-TF-IDF scores corresponding to each Chinese word in each topic class, obtaining preset m Chinese words with the largest c-TF-IDF scores in each topic class, forming each topic word in each topic class corresponding to the target corpus, realizing topic class analysis of the target corpus, and then entering step F.

Instead of using the conventional TF-IDF algorithm, a class-based TF-IDF algorithm called c-TF-IDF is employed herein to obtain the subject terms, c-TF-IDF differs from TF-IDF in terms of frequency level. In conventional TF-IDF, TF measures word frequencies in each document. Whereas in c-TF-IDF, TF measures word frequencies in each cluster, each cluster containing many documents.

In practical applications, the above step E is specifically designed to execute the following steps E1 to E3.

Step E1, based on each low-dimensional sentence enabling vector included in each topic class corresponding to the target corpus, each Chinese word segmentation of each item-labeled Chinese text included in each topic class is obtained, each Chinese word segmentation of each item-labeled Chinese text in each topic class is aggregated for each topic class, a single document corresponding to the topic class is formed, further documents corresponding to each topic class corresponding to the target corpus are obtained, and step E2 is carried out.

obtaining c-TF-IDF scores corresponding to the Chinese segmentation words in the documents corresponding to the topic classes respectively, further obtaining c-TF-IDF scores corresponding to the Chinese segmentation words in the topic classes corresponding to the target corpus respectively, and then entering a step E3; wherein g _c，t Representing the number of times a Chinese word t in a topic class c appears in the documents of the topic class c, G _c The number of Chinese segmentations in the document representing topic class c, tf _t Representing the number of occurrences of the Chinese word t in the documents of all the topic classes, A representing the average number of Chinese words in the documents of each topic class.

And F, word embedding representation of the subject words based on a DSG model (directionSkip-Gram), obtaining high-dimensional vectors respectively corresponding to the subject words in each subject class corresponding to the target corpus, and combining each historical subject class in the historical subject class set and each subject word in the historical subject class to realize new subject class analysis of each subject class corresponding to the target corpus and compared with the historical subject class set through similarity calculation.

The partial classification cluster c-TF-IDF score is calculated in the implementation application, the first 20 Chinese word fragments of the cluster class c-TF-IDF score are extracted, and then the implementation application is shown in figure 8, and after a certain day, the whole theme class of the theme is shown in figure 9, and then the step F is carried out.

In practical applications, the above step F is performed as shown in fig. 10, and the specific design is as follows steps F1 to F3.

And F1, word embedding representation of the subject words based on the DSG model, obtaining high-dimensional vectors of each subject word corresponding to preset dimensions in each subject class corresponding to the corpus in the target, and then entering step F2.

Step F2., for each topic class corresponding to the target corpus, performs the following steps F2-1 to F2-4 to obtain the similarity between each topic class corresponding to the target corpus and the historical topic class set, so as to form a classification index of each topic class corresponding to the target corpus, and then proceeds to step F3.

S _{i_k_j_l＝} model.similarity(old_topics_word[k][l]，new_topics_word[i][j])

calculating word similarity S between jth subject word in subject class i corresponding to the corpus in the target and the ith subject word in the kth historical subject class _{i_k_j_l} Then enter step F2-2; wherein J is more than or equal to 1 and less than or equal to J, J represents the number of subject words in the subject class corresponding to the corpus in the target, K is more than or equal to 1 and less than or equal to K, K represents the number of history subject classes in the history subject class set, L is more than or equal to 1 and less than or equal to L, and L represents the number of subject words in the history subject class; old_topics_word [ k ]][l]Representing the first subject term in the kth historical subject class, new_topics_word [ i ]][j]And (3) representing the j-th subject word in the subject class i corresponding to the corpus in the target, wherein model similarity (. Cndot.) represents a similarity function, and a cosine measurement formula is adopted.

F2-2, based on 1.ltoreq.l.ltoreq.L, the following formula is adopted:

S _{i_k_j} ＝maxS _{i_k_j_l}

calculating similarity S between j-th subject word and k-th historical subject class in subject class i corresponding to the corpus in the target _{i_k_j} Step F2-3 is then entered.

calculating and obtaining similarity S between topic class i corresponding to the corpus in the target and kth historical topic class _{i_k} Step F2-4 is then entered.

S _i ＝maxS _{i_k}

calculating and obtaining similarity S between topic class i corresponding to the corpus in the target and historical topic class set _i 。

Step F3., initializing λ=0.8 according to a preset classification threshold λ within a value range from 0 to 1, and if the classification index of the topic class is less than or equal to λ, determining that the topic class is a new topic class compared with the historical topic class set, and adding the topic class and each topic word in the topic class to the historical topic class set; if the classification index of the topic class is > lambda, then the topic class is determined to be not a new topic class compared with the historical topic class set. The actual application end result is shown in fig. 11.

The direct skip-gram (DSG) model is used instead of skip-gram (SG) for word embedding in step F, because SG model, while widely used for various tasks, relies on word co-occurrence in local context in word prediction, ignoring further detailed information such as word order, position, etc. While the DSG model considers not only co-occurrence patterns of words, but also their relative positions, modeled by a special "direction" vector that indicates whether the word to be predicted is to the left or right of a given word. The DSG model provides a new Softmax function based on the SG model:

By for each w _t+i Introducing a new vector delta to represent it and w _t To measure the relative direction of the context word w _t+i How w is associated in its left or right context _t And (5) associating.

Where σ represents a sigmoid function and y represents a learning rate. D is the designation w _t+i Given w _t Is defined as

Finally, the Directional Skip-Gram Model formula is:

f(ω _t+i ，ω _t )＝p(ω _t+i |ω _i )+g(ω _t+i ，ω _t )

the word embedded representation of each subject word is required to be obtained according to the DSG model, then the similarity between the subject words of different subjects and the similarity between the subject words and the subjects are calculated according to the high-dimensional word embedded representation vectors, and the similarity between the different subjects is calculated, so that the fluctuation detection of the subjects is realized, and the steps F1 to F3 are specifically designed and executed.

In practice, the big data topic analysis method based on the embedded model adopts any one of sine distance, cosine distance and Euler distance for calculating the distance between the related vectors.

In practical application, the big data topic analysis method designed by the invention firstly uses a Sentence-BERT (SBERT) model which is finely tuned based on a BERT pre-training model to carry out Sentence embedding representation on each Chinese text data after pretreatment, then uses a projection dimension reduction UMAP algorithm to carry out dimension reduction treatment on the vectors after Sentence embedding, then uses an HDBSCAN clustering algorithm to cluster the vectors after dimension reduction, distributes each Chinese text in a target Chinese set to a corresponding topic class, acquires topic representation according to a class-based c-TF-IDF algorithm, and selects each Chinese word with the highest c-TF-IDF score to represent each topic class; finally, word embedding representation is carried out on the subject words by using a direct Skip-Gram (DSG) model improved based on a Skip-Gram (SG) model, the similarity between different subject words and the similarity between different subject words are calculated, and further fluctuation detection is carried out on whether the newly-appearing subject words exist or not. Compared with an NMF topic model and an LDA topic model, the extracted topic model is designed by the whole scheme, has higher topic consistency and topic diversity, and can timely and accurately detect a new hot spot appeal topic and give an early warning when detecting the volatility of a new topic class.

The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the spirit of the present invention.

Claims

1. A big data topic analysis method based on an embedded model is characterized in that: the topic class analysis of the corpus in the target is realized according to the following steps A to E;

2. The big data topic analysis method based on the embedded model of claim 1, wherein: the method also comprises a step F, wherein after the step E is executed, the step F is entered;

3. The big data topic analysis method based on the embedded model of claim 1, wherein: in the step A, firstly, removing a target Chinese text expressed by a non-text expressive person in a target Chinese text set; then, aiming at each target Chinese text in the target Chinese set, executing Chinese fine-granularity word segmentation to obtain each Chinese word segmentation corresponding to the target Chinese text, removing Chinese stop words in the Chinese word segmentation, and updating each Chinese word segmentation corresponding to the target Chinese text; and further obtaining each Chinese word segmentation corresponding to each item mark Chinese text.

4. The big data topic analysis method based on the embedded model of claim 1, wherein: the step B comprises the steps of B1 to B4;

O＝softmax(w(u,ν,|u-ν|))

5. The big data topic analysis method based on the embedded model of claim 1, wherein: the step C comprises the following steps C1 to C3;

6. The big data topic analysis method based on the embedded model of claim 1, wherein: the step D comprises the following steps D1 to D6;

Step D2., for each low-dimensional sentence reducing vector, obtaining a ranking based on the distance from the low-dimensional sentence reducing vector x to the other low-dimensional sentence reducing vectorsDistance core of the m _m (x)＝d(x,N ^m (x) A core distance corresponding to the low-dimensional sentence sounding vector x is formed; obtaining the core distance corresponding to each low-dimensional sentence enabling vector, and then entering a step D3; wherein N is ^m (x) Represents other low-dimensional sentence sounding vectors, d (x, N) ^m (x) X and N are represented by ^m (x) A distance function between;

d _mreach-m (a,b)＝max{core _m (a),core _m (b),d(a,b)}

obtaining the inter-arrival distance between every two low-dimensional sentence enabling vectors, and then entering a step D4; wherein the core _m (a)、core _m (b) Respectively representing a core distance corresponding to the low-dimensional sentence sounding vector a and a core distance corresponding to the low-dimensional sentence sounding vector b, d (a, b) represents a distance between the low-dimensional sentence sounding vector a and the low-dimensional sentence sounding vector b, max { _mreach-m (a, b) represents the inter-arrival distance between the low-dimensional sentence casting vector a and the low-dimensional sentence casting vector b;

7. The big data topic analysis method based on the embedded model of claim 1, wherein: the step E comprises the steps E1 to E3;

obtaining c-TF-IDF scores corresponding to the Chinese segmentation words in the documents corresponding to the topic classes respectively, further obtaining c-TF-IDF scores corresponding to the Chinese segmentation words in the topic classes corresponding to the target corpus respectively, and then entering a step E3; wherein g _c,t Representing the number of times a Chinese word t in a topic class c appears in the documents of the topic class c, G _c The number of Chinese segmentations in the document representing topic class c, tf _t Representing the number of times that Chinese word t appears in the documents of all topic classes, A representing the average number of Chinese word t in the documents of each topic class;

8. The big data topic analysis method based on the embedded model of claim 2, wherein: the step F comprises the steps of F1 to F3;

S _{i_k_j_l} ＝model.similarity(old_topics_word[k][l],new_topics_word[i][j])

calculating word similarity S between jth subject word in subject class i corresponding to the corpus in the target and the ith subject word in the kth historical subject class _{i_k_j_l} Then enter step F2-2; wherein J is more than or equal to 1 and less than or equal to J, J represents the number of subject words in the subject class corresponding to the corpus in the target, K is more than or equal to 1 and less than or equal to K, K represents the number of history subject classes in the history subject class set, L is more than or equal to 1 and less than or equal to L, and L represents the number of subject words in the history subject class; old_topics_word [ k ] ][l]Representing the first subject term in the kth historical subject class, new_topics_word [ i ]][j]The j-th subject term in the subject class i corresponding to the corpus in the target is represented, the model similarity (·) represents a similarity function, and a cosine measurement formula is adopted;

f2-2, based on 1.ltoreq.l.ltoreq.L, the following formula is adopted:

S _{i_k_j} ＝maxS _{i_k_j_l}

S _i ＝maxS _{i_k}

Step F3., according to a preset classification threshold lambda in the value range from 0 to 1, based on the classification indexes of the topic classes corresponding to the corpus in the target, if the classification indexes of the topic classes are less than or equal to lambda, judging that the topic class is a new topic class compared with the historical topic class set, and adding the topic class and each topic word in the topic class into the historical topic class set; if the classification index of the topic class is > lambda, then the topic class is determined to be not a new topic class compared with the historical topic class set.

9. The big data topic analysis method based on an embedded model according to any one of claims 1 to 8, wherein: the distance between the vectors is calculated by adopting any one of sine distance, cosine distance and Euler distance.