CN114661903A

CN114661903A - Deep semi-supervised text clustering method, device and medium combining user intention

Info

Publication number: CN114661903A
Application number: CN202210208434.5A
Authority: CN
Inventors: 黄瑞章; 李静楠; 秦永彬; 陈艳平
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2022-03-03
Filing date: 2022-03-03
Publication date: 2022-06-24
Anticipated expiration: 2042-03-03
Also published as: CN114661903B

Abstract

The invention provides a deep semi-supervised text clustering method, equipment and a medium in combination with user intention, wherein the method comprises the following steps: the method comprises the following steps: constructing an intention information matrix; step two: performing vector mapping on the text, and extracting features of the text vector through a neural network; step three: optimizing the encoder by using the intention information matrix to further obtain better feature representation; step four: utilizing KL divergence to assist optimization to obtain an initial clustering result; step five: and constructing an optimization function, and guiding the clustering direction of the clusters by using the intention information. On the basis of giving constraint pair supervision information, the intention information is mined by fully utilizing a deep neural network, the intention information is fused into feature representation, and meanwhile, the clustering process is supervised by utilizing the intention information, so that the problems of the semi-supervised text clustering text representation difference, insufficient supervision and neglect of the intention of a user are effectively solved, the accuracy of a clustering result is improved, and the clustering result more suitable for downstream tasks is obtained.

Description

Deep semi-supervised text clustering method, device and medium combining user intention

Technical Field

The invention belongs to the technical field of information extraction and text processing, and particularly relates to a deep semi-supervised text clustering method, equipment and medium in combination with user intention.

Background

With the advent of the information age, large-scale data has appeared in the form of text in front of human beings. Text clustering is the classification of similar text documents into one class, and is one of the most important algorithms in the field of data mining. The traditional unsupervised text clustering classifies clusters according to the similarity between documents, and does not need any data attribute during classification. With the diversification of application scenes and the differentiation development of downstream tasks, different users have different clustering division intents for the same batch of data, and the users need to guide clustering results according to the intents. For example, for the same batch of news text data, the intention of the user a is divided according to the 'region' to which the news belongs, and the intention of the user B is divided according to the 'subject' of the news. Different intents may result in different clustering results. However, the traditional unsupervised clustering algorithm can only divide the structure according to the characteristics of data, and the intention information provided by the user cannot be considered. Therefore, in practical application, a user provides different monitoring information according to different downstream task requirements, and the monitoring information is used for guiding clustering, so that semi-monitoring text clustering is realized. Semi-supervised clustering is a new learning method provided by combining semi-supervised learning and clustering analysis, and the semi-supervised clustering is widely regarded and applied in machine learning. The semi-supervised text clustering algorithm is a method for grouping documents with a small amount of supervision information. The method effectively utilizes the supervision information, improves the performance of the algorithm and reduces the calculation complexity. From a theoretical level, the technical research of semi-supervised text clustering can provide theoretical support for other natural language processing technologies, and is a natural language processing project worthy of proceeding.

Semi-supervised text clustering has been widely studied in machine learning from different aspects, and a large number of semi-supervised text clustering algorithms have been proposed for various problems, and the semi-supervised clustering methods are classified into the following 3 types: based on constrained semi-supervised clustering, the idea of the algorithm is characterized in that constraint limiting information is added on the basis of the traditional clustering to enable the clustering effect to reach the best; the algorithm is characterized in that in the process of preprocessing data, similarity measurement among samples is transformed, so that a new measurement function is obtained, and associated positive constraint samples are closer and negative samples are opposite; based on semi-supervised clustering combining constraint and distance, the algorithm is a new algorithm obtained by combining the former two methods, and a better clustering effect can be obtained. However, these methods have the following disadvantages: firstly, the text represents the difference problem, in practical application, the text expression has difference, and different expression emphasis is required for different user clustering intents; secondly, the intention supervision is weak, and supervision information can only guide the structural division of a small number of text samples, so that the whole clustering intention of the user cannot be accurately expressed; finally, the user intention is ignored, and different clustering results meeting the user intention can not be obtained for the same batch of data samples according to a specific application scene and downstream task requirements.

Disclosure of Invention

The invention provides a deep semi-supervised text clustering method, equipment and medium combined with user intention, aiming at solving the problems in the prior art.

The invention is realized by the following technical scheme, and provides a deep semi-supervised text clustering method combined with user intention, which specifically comprises the following steps:

the method comprises the following steps: processing constraint information given by a user into an intention matrix;

step two: learning an initial feature representation for the text through a pre-training depth self-encoder;

step three: carrying out similarity normalization processing on the initial feature representation, carrying out fitting calculation loss by using an intention matrix, and continuously reversely adjusting and optimizing the parameters of the encoder to obtain final feature representation;

step four: clustering the obtained feature vectors by utilizing the KL divergence to obtain a text clustering pseudo label;

step five: and calculating a loss function, namely an optimization function, on the obtained pseudo label by using the intention graph matrix, and performing iterative optimization step three to obtain a text clustering result which finally accords with the intention of the user.

Further, in the first step, according to the pair of constraint information given by the user, the association relationship between the data points is mined, so as to construct an intention information matrix with the size of n × n, wherein n is the size of the data set.

Further, in the second step, the text is vectorized and represented, and in the vectorization representation process, the following steps are selected: and mapping by using a Word frequency TF, a Word frequency-inverse text frequency index TF-IDF or a Word2Vec method.

Further, in the third step, the initial feature representation obtained in the second step is subjected to matrix multiplication to obtain a Similarity matrix for converging all text data, and the Similarity matrix and the intention information matrix are used for performing Similarity Loss calculation to obtain corresponding Similarity Loss; and fine-tuning the encoder in the second step by minimizing the similarity loss to finally obtain the text feature representation fused with the semantic information of the user intention.

Further, in the fourth step, the distribution Q of the text vectors is obtained through the third step, in order to enable the distribution to have higher confidence coefficient, the distribution P is further calculated according to Q, the difference loss between the two distributions is calculated by utilizing a KL divergence formula, the high confidence coefficient distribution of the loss-assisted model learning is minimized, the model parameters and the clustering mass center are refined, and therefore the clustered pseudo label result is obtained.

Further, in the fifth step, a label information matrix with the size of n × n is constructed according to the pseudo label obtained in the fourth step, a new optimization function is constructed to calculate the loss between the label information matrix and the intention information matrix, the loss is minimized to optimize and guide the clustering process, and an optimal clustering result is finally obtained through iteration, so that the purpose of guiding the clustering process by using the constraint information is achieved, and a text clustering result combining the user intention is obtained.

Further, the optimization function is of the form:

wherein,

representative sample x_iTo which the group of (a) belongs,

representative sample x_jTo which category (c) belongs; must-link means that the two sample points Must belong to the same class, and Cannot-link means that the two sample points Must not belong to the same class.

The invention provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the deep semi-supervised text clustering method combined with the user intention when executing the computer program.

The invention proposes a computer readable storage medium for storing computer instructions which, when executed by a processor, implement the steps of the method for deep semi-supervised text clustering in combination with user intent.

The invention has the beneficial effects that:

(1) the method can mine and fuse the user intention into the semi-supervised text clustering, obtain the characteristic representation which can express the user intention more according to the user intention in a targeted manner, obtain the clustering result which meets the user requirement more, and adapt to different downstream tasks. (2) The clustering process is guided by intention, so that the problem of weak supervision strength of the supervision information can be solved, and a new thought is provided for the follow-up research of semi-supervised text clustering. (3) In view of the important role played by text clustering in the field of natural language processing, the semi-supervised text clustering introduced with the user intention can obtain a better clustering result, is suitable for different application scenes, provides more favorable support and has greater theoretical significance and practical value.

Drawings

FIG. 1 is a technical roadmap for the present invention;

FIG. 2 is a diagram of a process model of the present invention;

FIG. 3 is a schematic diagram of the intent mining and fusion method of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The corresponding clustering result can be obtained by combining the deep semi-supervised text clustering method of the user intention, and the technical problem to be solved by the invention is solved by considering the intentions of different users. In the invention, constraint pair information provided by a user is used as supervision information, user intentions are mined out from the supervision information, and then characteristic representation fusing the user intentions is obtained, in the clustering process, the intention information is used for guiding the clustering direction, and the intention information supervision target is achieved in two stages, so that a clustering result meeting the user intentions is obtained, and the requirements of different application scenes and downstream tasks are met.

The invention provides a deep semi-supervised text clustering method combined with user intention. The technical problem to be solved by the invention is as follows: the encoding mode for mining and fusing the intention information is provided, a neural network technology is introduced from the perspective of fully utilizing supervision information, the characteristic that the neural network extracts high-dimensional abstract features is fully exerted, the intention information is mined and fused into feature representation of a text, and the aim of guiding clustering by the intention information is fulfilled through an iterative optimization clustering process, so that the accuracy of a clustering result is improved, and the existing problems are effectively solved.

With reference to fig. 1 to 3, the present invention provides a deep semi-supervised text clustering method with reference to user intentions, and the method specifically includes:

step four: clustering the obtained feature vectors by using the KL divergence to obtain a text clustering pseudo label;

step five: and calculating a loss function, namely an optimization function, on the obtained pseudo label by using the intention graph matrix, and iterating and optimizing the step three to obtain a text clustering result which finally accords with the intention of the user.

In the first step, according to the paired constraint information given by the user, the incidence relation between the data points is mined, so that an intention information matrix with the size of n x n is constructed, wherein n is the size of the data set. The construction of the matrix realizes the aim of digitally coding the user intention, and facilitates the subsequent calculation of the loss function of each part.

In the second step, the text is vectorized and represented, and the following steps are selected in the vectorization representation process: and mapping by using a Word frequency TF, a Word frequency-inverse text frequency index TF-IDF or a Word2Vec method. The vectorized representation of the text is often high-dimensional, and in the training process, in order to avoid dimensional disasters, the method is based on a neural network, and a self-encoder is pre-trained for feature representation learning, so that the initial feature representation of the text is obtained.

In the third step, the initial feature representation obtained in the second step is multiplied by a matrix to obtain a Similarity matrix for converging all text data, and the Similarity matrix and the intention information matrix are used for performing Similarity Loss calculation to obtain corresponding Similarity Loss; and fine-tuning the encoder in the second step by minimizing the similarity loss to finally obtain the text feature representation fused with the semantic information of the user intention. The feature representation output by this step is used for the subsequent clustering process.

In the fourth step, the distribution Q of the text vectors is obtained through the third step, in order to enable the distribution to have higher confidence coefficient, the distribution P is further calculated according to Q, the difference loss between the two distributions is calculated by utilizing a KL divergence formula, the high confidence coefficient distribution of the loss auxiliary model learning is minimized, the model parameters and the clustering mass center are refined, and therefore the clustered pseudo label result is obtained.

And in the fifth step, a label information matrix with the size of n x n is constructed according to the pseudo label obtained in the fourth step, a new optimization function is constructed to be used for calculating the loss between the label information matrix and the intention information matrix, the loss is minimized to optimize and guide the clustering process, and the optimal clustering result is finally obtained through iteration, so that the aim of guiding the clustering process by using the constraint information is achieved, and the text clustering result combining the user intention is obtained.

According to the method, the encoder and the clustering process are optimized and guided through the intention information, the clustering result meeting the requirement can be obtained by combining the intention of the user, and the model can achieve better performance through experimental verification.

Example (b): as shown in fig. 1 to 3, a deep semi-supervised text clustering method combining user intentions includes the following steps: the method comprises the following steps: constructing an intention information matrix; step two: performing vector mapping on the text, and extracting features of the text vector through a neural network; step three: optimizing the encoder by using the intention information matrix to further obtain better feature representation; step four: utilizing KL divergence to assist optimization to obtain an initial clustering result; step five: and constructing an optimization function, and guiding the clustering direction of the clusters by using the intention information.

In the first step, according to the paired constraint information given by the user, the incidence relation between data points is mined, so that an intention information matrix R with the size of n x n is constructed, wherein n is the size of the data set. Given X as the original text data sample, each point X in the matrix_ijThe value of (a) represents the sample data x_iAnd x_jThere are three values of the constraint relationship between the two, r_ijSample x is represented by 1_iAnd sample x_jAre in the same cluster, r_ij1 stands for sample x_iAnd sample x_jNot in the same cluster, r_ij0 represents a data-to-tentative unconstrained relationship. The construction of the matrix realizes the aim of digitally coding the user intention, and facilitates the subsequent calculation of the loss function of each part.

In the second step, the text is vectorized and expressed, and this link can be selected as follows: and mapping by using methods such as TF (Word frequency), TF-IDF (Word frequency-inverse text frequency index) or Word2 Vec. The vectorized representation of the text is often high-dimensional, and in the training process, in order to avoid dimensional disasters, the method is based on a neural network, and a self-encoder is pre-trained for feature representation learning, so that the initial feature representation of the text is obtained.

In the third step, the intention mining and feature representation fusion part mainly solves the problem of text representation difference, and after feature coding is carried out on text data, fusion intention information is the key. And (4) multiplying the initial characteristic representation obtained in the step two by a matrix to obtain a Similarity matrix for converging all text data, and performing Similarity Loss calculation by using the Similarity matrix and the intention information matrix to obtain corresponding Similarity Loss, wherein intention information fusion is performed by minimizing the Loss.

In fig. 2, X represents an original data sample, and Z represents a feature representation obtained after an original feature distribution passes through an Encoder module. The invention constructs an IMA module to mine and fuse user intention information. And (4) performing fusion intention coding by using the intention matrix R obtained in the step one, wherein the technical principle of the fusion intention coding is shown in FIG. 3.

In fig. 3, Z is a representation of the feature vector obtained from the Encoder part, and the size is n × d, where n is the size of the data set and d is 10. The invention adopts a method of multiplying Z by its own transposition to obtain a matrix W with n x n dimension, namely a similarity matrix among samples in a data set. Then, two thresholds _ up and _ down are set through a Normalization algorithm to normalize the similarity matrix, so as to obtain a new matrix S, wherein the value rule of the matrix S is as follows:

the model intention matrix R designed by the invention continuously optimizes the matrix S and recalls the Encoder part. The Loss function algorithm applied here is Similarity Loss, which can jointly measure the self-Similarity and relative Similarity of the sample pairs, so that it can optimize the correlation coding between the sample pairs through iteration. And (5) finely adjusting the encoder in the second step by minimizing the loss, and finally obtaining the text feature representation fused with the semantic information of the user intention. The feature representation distribution output by this step is used for the subsequent clustering process.

In the fourth step, the distribution Q of the text vectors is obtained through the third step, and the distribution P is further calculated according to Q in order to make the distribution have higher confidence. The method of the invention sets a clustering loss function, calculates the difference loss between two distributions by using a KL divergence formula, and minimizes the loss to assist the model to learn high-confidence-level distribution so as to refine model parameters and clustering mass centers. Thereby obtaining a clustered pseudo-label result.

And step five, constructing a label information matrix with the size of n x n according to the pseudo labels obtained in the step four, and constructing a new optimization function for calculating the loss between the matrix and the intention information matrix. This loss is minimized to optimize and guide the clustering process.

The purpose of the intention guide clustering process is to solve the problem of weak supervision, and find clusters which can meet given constraint conditions to the greatest extent under the condition of considering the user intention. And the guidance strength is enhanced by learning the similarity relation between the constraint information pairs. And after continuous iteration, the clustering result is optimized along with a specific direction. To achieve this, the invention sets an optimization function in the form of:

wherein,

representative sample x_iTo which the group of (a) is assigned,

representative sample x_jTo which category; must-link means that the two sample points Must belong to the same class, and Cannot-link means that the two sample points Must not belong to the same class. For two constraint relations, the following pairing cost calculation formula is set:

same pair constraint pairing cost:

L(X_p，X_q)⁺＝D_KL(P^*||Q)+D_KL(Q^*||P)

different pairs of constraint pairing costs:

L(X_p，X_q)^-＝L_n(D_KL(P^*||Q)，σ)+L_n(D_KL(Q^*||P)，σ)

L_n(e,σ)＝max(0,σ-e)

where σ is a set parameter to prevent overfitting. Minimizing Loss by iteration_IDCAnd finally, an optimal clustering result can be obtained, so that the aim of guiding a clustering process by using constraint information is fulfilled, and a text clustering result combining the user intention is obtained.

In conclusion, the deep semi-supervised text clustering method combined with the user intention has excellent performance.

The method, the device and the medium for deep semi-supervised text clustering in combination with user intention are introduced in detail, specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A deep semi-supervised text clustering method combined with user intention is characterized by specifically comprising the following steps:

2. The method according to claim 1, wherein in the first step, association relationship between data points is mined according to paired constraint information given by a user, so as to construct an intention information matrix with size n x n, wherein n is data set size.

3. The method according to claim 2, wherein in the second step, the text is vectorized and represented, and in the vectorization representation process, the following steps are selected: and mapping by using a Word frequency TF, a Word frequency-inverse text frequency index TF-IDF or a Word2Vec method.

4. The method according to claim 3, wherein in the third step, the initial feature representation obtained in the second step is multiplied by a matrix to obtain a Similarity matrix for converging all text data, and the Similarity matrix and the intention information matrix are used for Similarity Loss calculation to obtain corresponding Similarity Loss; and fine-tuning the encoder in the second step by minimizing the similarity loss to finally obtain the text feature representation fused with the semantic information of the user intention.

5. The method according to claim 4, wherein in the fourth step, after the third step, a distribution Q of text vectors is obtained, and in order to further calculate a distribution P according to Q with higher confidence of the distribution, a KL divergence formula is used to calculate a difference loss between the two distributions, and the loss-aided model is minimized to learn high confidence distribution, so as to refine model parameters and cluster centroids, thereby obtaining a pseudo-label result of the clusters.

6. The method according to claim 5, wherein in the fifth step, a label information matrix with the size of n x n is constructed according to the pseudo label obtained in the fourth step, a new optimization function is constructed to calculate the loss between the label information matrix and the intention information matrix, the loss is minimized to optimize and guide the clustering process, and through iteration, the optimal clustering result is finally obtained, so that the purpose of guiding the clustering process by using the constraint information is achieved, and the text clustering result combining the user intention is obtained.

7. The method of claim 6, wherein the optimization function is of the form:

wherein,

representative sample x_iTo which the group of (a) belongs,

8. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1-7 when executing the computer program.

9. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 7.