CN114661903A - Deep semi-supervised text clustering method, device and medium combining user intention - Google Patents
Deep semi-supervised text clustering method, device and medium combining user intention Download PDFInfo
- Publication number
- CN114661903A CN114661903A CN202210208434.5A CN202210208434A CN114661903A CN 114661903 A CN114661903 A CN 114661903A CN 202210208434 A CN202210208434 A CN 202210208434A CN 114661903 A CN114661903 A CN 114661903A
- Authority
- CN
- China
- Prior art keywords
- text
- intention
- clustering
- matrix
- loss
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 239000011159 matrix material Substances 0.000 claims abstract description 56
- 238000005457 optimization Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims abstract description 12
- 238000013507 mapping Methods 0.000 claims abstract description 6
- 238000009826 distribution Methods 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012549 training Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 238000004422 calculation algorithm Methods 0.000 description 11
- 230000004927 fusion Effects 0.000 description 5
- 238000005065 mining Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 238000007500 overflow downdraw method Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a deep semi-supervised text clustering method, equipment and a medium in combination with user intention, wherein the method comprises the following steps: the method comprises the following steps: constructing an intention information matrix; step two: performing vector mapping on the text, and extracting features of the text vector through a neural network; step three: optimizing the encoder by using the intention information matrix to further obtain better feature representation; step four: utilizing KL divergence to assist optimization to obtain an initial clustering result; step five: and constructing an optimization function, and guiding the clustering direction of the clusters by using the intention information. On the basis of giving constraint pair supervision information, the intention information is mined by fully utilizing a deep neural network, the intention information is fused into feature representation, and meanwhile, the clustering process is supervised by utilizing the intention information, so that the problems of the semi-supervised text clustering text representation difference, insufficient supervision and neglect of the intention of a user are effectively solved, the accuracy of a clustering result is improved, and the clustering result more suitable for downstream tasks is obtained.
Description
Technical Field
The invention belongs to the technical field of information extraction and text processing, and particularly relates to a deep semi-supervised text clustering method, equipment and medium in combination with user intention.
Background
With the advent of the information age, large-scale data has appeared in the form of text in front of human beings. Text clustering is the classification of similar text documents into one class, and is one of the most important algorithms in the field of data mining. The traditional unsupervised text clustering classifies clusters according to the similarity between documents, and does not need any data attribute during classification. With the diversification of application scenes and the differentiation development of downstream tasks, different users have different clustering division intents for the same batch of data, and the users need to guide clustering results according to the intents. For example, for the same batch of news text data, the intention of the user a is divided according to the 'region' to which the news belongs, and the intention of the user B is divided according to the 'subject' of the news. Different intents may result in different clustering results. However, the traditional unsupervised clustering algorithm can only divide the structure according to the characteristics of data, and the intention information provided by the user cannot be considered. Therefore, in practical application, a user provides different monitoring information according to different downstream task requirements, and the monitoring information is used for guiding clustering, so that semi-monitoring text clustering is realized. Semi-supervised clustering is a new learning method provided by combining semi-supervised learning and clustering analysis, and the semi-supervised clustering is widely regarded and applied in machine learning. The semi-supervised text clustering algorithm is a method for grouping documents with a small amount of supervision information. The method effectively utilizes the supervision information, improves the performance of the algorithm and reduces the calculation complexity. From a theoretical level, the technical research of semi-supervised text clustering can provide theoretical support for other natural language processing technologies, and is a natural language processing project worthy of proceeding.
Semi-supervised text clustering has been widely studied in machine learning from different aspects, and a large number of semi-supervised text clustering algorithms have been proposed for various problems, and the semi-supervised clustering methods are classified into the following 3 types: based on constrained semi-supervised clustering, the idea of the algorithm is characterized in that constraint limiting information is added on the basis of the traditional clustering to enable the clustering effect to reach the best; the algorithm is characterized in that in the process of preprocessing data, similarity measurement among samples is transformed, so that a new measurement function is obtained, and associated positive constraint samples are closer and negative samples are opposite; based on semi-supervised clustering combining constraint and distance, the algorithm is a new algorithm obtained by combining the former two methods, and a better clustering effect can be obtained. However, these methods have the following disadvantages: firstly, the text represents the difference problem, in practical application, the text expression has difference, and different expression emphasis is required for different user clustering intents; secondly, the intention supervision is weak, and supervision information can only guide the structural division of a small number of text samples, so that the whole clustering intention of the user cannot be accurately expressed; finally, the user intention is ignored, and different clustering results meeting the user intention can not be obtained for the same batch of data samples according to a specific application scene and downstream task requirements.
Disclosure of Invention
The invention provides a deep semi-supervised text clustering method, equipment and medium combined with user intention, aiming at solving the problems in the prior art.
The invention is realized by the following technical scheme, and provides a deep semi-supervised text clustering method combined with user intention, which specifically comprises the following steps:
the method comprises the following steps: processing constraint information given by a user into an intention matrix;
step two: learning an initial feature representation for the text through a pre-training depth self-encoder;
step three: carrying out similarity normalization processing on the initial feature representation, carrying out fitting calculation loss by using an intention matrix, and continuously reversely adjusting and optimizing the parameters of the encoder to obtain final feature representation;
step four: clustering the obtained feature vectors by utilizing the KL divergence to obtain a text clustering pseudo label;
step five: and calculating a loss function, namely an optimization function, on the obtained pseudo label by using the intention graph matrix, and performing iterative optimization step three to obtain a text clustering result which finally accords with the intention of the user.
Further, in the first step, according to the pair of constraint information given by the user, the association relationship between the data points is mined, so as to construct an intention information matrix with the size of n × n, wherein n is the size of the data set.
Further, in the second step, the text is vectorized and represented, and in the vectorization representation process, the following steps are selected: and mapping by using a Word frequency TF, a Word frequency-inverse text frequency index TF-IDF or a Word2Vec method.
Further, in the third step, the initial feature representation obtained in the second step is subjected to matrix multiplication to obtain a Similarity matrix for converging all text data, and the Similarity matrix and the intention information matrix are used for performing Similarity Loss calculation to obtain corresponding Similarity Loss; and fine-tuning the encoder in the second step by minimizing the similarity loss to finally obtain the text feature representation fused with the semantic information of the user intention.
Further, in the fourth step, the distribution Q of the text vectors is obtained through the third step, in order to enable the distribution to have higher confidence coefficient, the distribution P is further calculated according to Q, the difference loss between the two distributions is calculated by utilizing a KL divergence formula, the high confidence coefficient distribution of the loss-assisted model learning is minimized, the model parameters and the clustering mass center are refined, and therefore the clustered pseudo label result is obtained.
Further, in the fifth step, a label information matrix with the size of n × n is constructed according to the pseudo label obtained in the fourth step, a new optimization function is constructed to calculate the loss between the label information matrix and the intention information matrix, the loss is minimized to optimize and guide the clustering process, and an optimal clustering result is finally obtained through iteration, so that the purpose of guiding the clustering process by using the constraint information is achieved, and a text clustering result combining the user intention is obtained.
Further, the optimization function is of the form:
wherein,representative sample xiTo which the group of (a) belongs,representative sample xjTo which category (c) belongs; must-link means that the two sample points Must belong to the same class, and Cannot-link means that the two sample points Must not belong to the same class.
The invention provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the deep semi-supervised text clustering method combined with the user intention when executing the computer program.
The invention proposes a computer readable storage medium for storing computer instructions which, when executed by a processor, implement the steps of the method for deep semi-supervised text clustering in combination with user intent.
The invention has the beneficial effects that:
(1) the method can mine and fuse the user intention into the semi-supervised text clustering, obtain the characteristic representation which can express the user intention more according to the user intention in a targeted manner, obtain the clustering result which meets the user requirement more, and adapt to different downstream tasks. (2) The clustering process is guided by intention, so that the problem of weak supervision strength of the supervision information can be solved, and a new thought is provided for the follow-up research of semi-supervised text clustering. (3) In view of the important role played by text clustering in the field of natural language processing, the semi-supervised text clustering introduced with the user intention can obtain a better clustering result, is suitable for different application scenes, provides more favorable support and has greater theoretical significance and practical value.
Drawings
FIG. 1 is a technical roadmap for the present invention;
FIG. 2 is a diagram of a process model of the present invention;
FIG. 3 is a schematic diagram of the intent mining and fusion method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The corresponding clustering result can be obtained by combining the deep semi-supervised text clustering method of the user intention, and the technical problem to be solved by the invention is solved by considering the intentions of different users. In the invention, constraint pair information provided by a user is used as supervision information, user intentions are mined out from the supervision information, and then characteristic representation fusing the user intentions is obtained, in the clustering process, the intention information is used for guiding the clustering direction, and the intention information supervision target is achieved in two stages, so that a clustering result meeting the user intentions is obtained, and the requirements of different application scenes and downstream tasks are met.
The invention provides a deep semi-supervised text clustering method combined with user intention. The technical problem to be solved by the invention is as follows: the encoding mode for mining and fusing the intention information is provided, a neural network technology is introduced from the perspective of fully utilizing supervision information, the characteristic that the neural network extracts high-dimensional abstract features is fully exerted, the intention information is mined and fused into feature representation of a text, and the aim of guiding clustering by the intention information is fulfilled through an iterative optimization clustering process, so that the accuracy of a clustering result is improved, and the existing problems are effectively solved.
With reference to fig. 1 to 3, the present invention provides a deep semi-supervised text clustering method with reference to user intentions, and the method specifically includes:
the method comprises the following steps: processing constraint information given by a user into an intention matrix;
step two: learning an initial feature representation for the text through a pre-training depth self-encoder;
step three: carrying out similarity normalization processing on the initial feature representation, carrying out fitting calculation loss by using an intention matrix, and continuously reversely adjusting and optimizing the parameters of the encoder to obtain final feature representation;
step four: clustering the obtained feature vectors by using the KL divergence to obtain a text clustering pseudo label;
step five: and calculating a loss function, namely an optimization function, on the obtained pseudo label by using the intention graph matrix, and iterating and optimizing the step three to obtain a text clustering result which finally accords with the intention of the user.
In the first step, according to the paired constraint information given by the user, the incidence relation between the data points is mined, so that an intention information matrix with the size of n x n is constructed, wherein n is the size of the data set. The construction of the matrix realizes the aim of digitally coding the user intention, and facilitates the subsequent calculation of the loss function of each part.
In the second step, the text is vectorized and represented, and the following steps are selected in the vectorization representation process: and mapping by using a Word frequency TF, a Word frequency-inverse text frequency index TF-IDF or a Word2Vec method. The vectorized representation of the text is often high-dimensional, and in the training process, in order to avoid dimensional disasters, the method is based on a neural network, and a self-encoder is pre-trained for feature representation learning, so that the initial feature representation of the text is obtained.
In the third step, the initial feature representation obtained in the second step is multiplied by a matrix to obtain a Similarity matrix for converging all text data, and the Similarity matrix and the intention information matrix are used for performing Similarity Loss calculation to obtain corresponding Similarity Loss; and fine-tuning the encoder in the second step by minimizing the similarity loss to finally obtain the text feature representation fused with the semantic information of the user intention. The feature representation output by this step is used for the subsequent clustering process.
In the fourth step, the distribution Q of the text vectors is obtained through the third step, in order to enable the distribution to have higher confidence coefficient, the distribution P is further calculated according to Q, the difference loss between the two distributions is calculated by utilizing a KL divergence formula, the high confidence coefficient distribution of the loss auxiliary model learning is minimized, the model parameters and the clustering mass center are refined, and therefore the clustered pseudo label result is obtained.
And in the fifth step, a label information matrix with the size of n x n is constructed according to the pseudo label obtained in the fourth step, a new optimization function is constructed to be used for calculating the loss between the label information matrix and the intention information matrix, the loss is minimized to optimize and guide the clustering process, and the optimal clustering result is finally obtained through iteration, so that the aim of guiding the clustering process by using the constraint information is achieved, and the text clustering result combining the user intention is obtained.
According to the method, the encoder and the clustering process are optimized and guided through the intention information, the clustering result meeting the requirement can be obtained by combining the intention of the user, and the model can achieve better performance through experimental verification.
Example (b): as shown in fig. 1 to 3, a deep semi-supervised text clustering method combining user intentions includes the following steps: the method comprises the following steps: constructing an intention information matrix; step two: performing vector mapping on the text, and extracting features of the text vector through a neural network; step three: optimizing the encoder by using the intention information matrix to further obtain better feature representation; step four: utilizing KL divergence to assist optimization to obtain an initial clustering result; step five: and constructing an optimization function, and guiding the clustering direction of the clusters by using the intention information.
In the first step, according to the paired constraint information given by the user, the incidence relation between data points is mined, so that an intention information matrix R with the size of n x n is constructed, wherein n is the size of the data set. Given X as the original text data sample, each point X in the matrixijThe value of (a) represents the sample data xiAnd xjThere are three values of the constraint relationship between the two, rijSample x is represented by 1iAnd sample xjAre in the same cluster, rij1 stands for sample xiAnd sample xjNot in the same cluster, rij0 represents a data-to-tentative unconstrained relationship. The construction of the matrix realizes the aim of digitally coding the user intention, and facilitates the subsequent calculation of the loss function of each part.
In the second step, the text is vectorized and expressed, and this link can be selected as follows: and mapping by using methods such as TF (Word frequency), TF-IDF (Word frequency-inverse text frequency index) or Word2 Vec. The vectorized representation of the text is often high-dimensional, and in the training process, in order to avoid dimensional disasters, the method is based on a neural network, and a self-encoder is pre-trained for feature representation learning, so that the initial feature representation of the text is obtained.
In the third step, the intention mining and feature representation fusion part mainly solves the problem of text representation difference, and after feature coding is carried out on text data, fusion intention information is the key. And (4) multiplying the initial characteristic representation obtained in the step two by a matrix to obtain a Similarity matrix for converging all text data, and performing Similarity Loss calculation by using the Similarity matrix and the intention information matrix to obtain corresponding Similarity Loss, wherein intention information fusion is performed by minimizing the Loss.
In fig. 2, X represents an original data sample, and Z represents a feature representation obtained after an original feature distribution passes through an Encoder module. The invention constructs an IMA module to mine and fuse user intention information. And (4) performing fusion intention coding by using the intention matrix R obtained in the step one, wherein the technical principle of the fusion intention coding is shown in FIG. 3.
In fig. 3, Z is a representation of the feature vector obtained from the Encoder part, and the size is n × d, where n is the size of the data set and d is 10. The invention adopts a method of multiplying Z by its own transposition to obtain a matrix W with n x n dimension, namely a similarity matrix among samples in a data set. Then, two thresholds _ up and _ down are set through a Normalization algorithm to normalize the similarity matrix, so as to obtain a new matrix S, wherein the value rule of the matrix S is as follows:
the model intention matrix R designed by the invention continuously optimizes the matrix S and recalls the Encoder part. The Loss function algorithm applied here is Similarity Loss, which can jointly measure the self-Similarity and relative Similarity of the sample pairs, so that it can optimize the correlation coding between the sample pairs through iteration. And (5) finely adjusting the encoder in the second step by minimizing the loss, and finally obtaining the text feature representation fused with the semantic information of the user intention. The feature representation distribution output by this step is used for the subsequent clustering process.
In the fourth step, the distribution Q of the text vectors is obtained through the third step, and the distribution P is further calculated according to Q in order to make the distribution have higher confidence. The method of the invention sets a clustering loss function, calculates the difference loss between two distributions by using a KL divergence formula, and minimizes the loss to assist the model to learn high-confidence-level distribution so as to refine model parameters and clustering mass centers. Thereby obtaining a clustered pseudo-label result.
And step five, constructing a label information matrix with the size of n x n according to the pseudo labels obtained in the step four, and constructing a new optimization function for calculating the loss between the matrix and the intention information matrix. This loss is minimized to optimize and guide the clustering process.
The purpose of the intention guide clustering process is to solve the problem of weak supervision, and find clusters which can meet given constraint conditions to the greatest extent under the condition of considering the user intention. And the guidance strength is enhanced by learning the similarity relation between the constraint information pairs. And after continuous iteration, the clustering result is optimized along with a specific direction. To achieve this, the invention sets an optimization function in the form of:
wherein,representative sample xiTo which the group of (a) is assigned,representative sample xjTo which category; must-link means that the two sample points Must belong to the same class, and Cannot-link means that the two sample points Must not belong to the same class. For two constraint relations, the following pairing cost calculation formula is set:
same pair constraint pairing cost:
L(Xp,Xq)+=DKL(P*||Q)+DKL(Q*||P)
different pairs of constraint pairing costs:
L(Xp,Xq)-=Ln(DKL(P*||Q),σ)+Ln(DKL(Q*||P),σ)
Ln(e,σ)=max(0,σ-e)
where σ is a set parameter to prevent overfitting. Minimizing Loss by iterationIDCAnd finally, an optimal clustering result can be obtained, so that the aim of guiding a clustering process by using constraint information is fulfilled, and a text clustering result combining the user intention is obtained.
In conclusion, the deep semi-supervised text clustering method combined with the user intention has excellent performance.
The invention provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the steps of the deep semi-supervised text clustering method combined with the user intention when executing the computer program.
The invention proposes a computer readable storage medium for storing computer instructions which, when executed by a processor, implement the steps of the method for deep semi-supervised text clustering in combination with user intent.
The method, the device and the medium for deep semi-supervised text clustering in combination with user intention are introduced in detail, specific examples are applied in the text to explain the principle and the implementation mode of the invention, and the description of the above embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (9)
1. A deep semi-supervised text clustering method combined with user intention is characterized by specifically comprising the following steps:
the method comprises the following steps: processing constraint information given by a user into an intention matrix;
step two: learning an initial feature representation for the text through a pre-training depth self-encoder;
step three: carrying out similarity normalization processing on the initial feature representation, carrying out fitting calculation loss by using an intention matrix, and continuously reversely adjusting and optimizing the parameters of the encoder to obtain final feature representation;
step four: clustering the obtained feature vectors by utilizing the KL divergence to obtain a text clustering pseudo label;
step five: and calculating a loss function, namely an optimization function, on the obtained pseudo label by using the intention graph matrix, and performing iterative optimization step three to obtain a text clustering result which finally accords with the intention of the user.
2. The method according to claim 1, wherein in the first step, association relationship between data points is mined according to paired constraint information given by a user, so as to construct an intention information matrix with size n x n, wherein n is data set size.
3. The method according to claim 2, wherein in the second step, the text is vectorized and represented, and in the vectorization representation process, the following steps are selected: and mapping by using a Word frequency TF, a Word frequency-inverse text frequency index TF-IDF or a Word2Vec method.
4. The method according to claim 3, wherein in the third step, the initial feature representation obtained in the second step is multiplied by a matrix to obtain a Similarity matrix for converging all text data, and the Similarity matrix and the intention information matrix are used for Similarity Loss calculation to obtain corresponding Similarity Loss; and fine-tuning the encoder in the second step by minimizing the similarity loss to finally obtain the text feature representation fused with the semantic information of the user intention.
5. The method according to claim 4, wherein in the fourth step, after the third step, a distribution Q of text vectors is obtained, and in order to further calculate a distribution P according to Q with higher confidence of the distribution, a KL divergence formula is used to calculate a difference loss between the two distributions, and the loss-aided model is minimized to learn high confidence distribution, so as to refine model parameters and cluster centroids, thereby obtaining a pseudo-label result of the clusters.
6. The method according to claim 5, wherein in the fifth step, a label information matrix with the size of n x n is constructed according to the pseudo label obtained in the fourth step, a new optimization function is constructed to calculate the loss between the label information matrix and the intention information matrix, the loss is minimized to optimize and guide the clustering process, and through iteration, the optimal clustering result is finally obtained, so that the purpose of guiding the clustering process by using the constraint information is achieved, and the text clustering result combining the user intention is obtained.
7. The method of claim 6, wherein the optimization function is of the form:
8. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 1-7 when executing the computer program.
9. A computer-readable storage medium storing computer instructions, which when executed by a processor, perform the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210208434.5A CN114661903B (en) | 2022-03-03 | 2022-03-03 | Deep semi-supervised text clustering method, device and medium combining user intention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210208434.5A CN114661903B (en) | 2022-03-03 | 2022-03-03 | Deep semi-supervised text clustering method, device and medium combining user intention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114661903A true CN114661903A (en) | 2022-06-24 |
CN114661903B CN114661903B (en) | 2024-07-09 |
Family
ID=82027540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210208434.5A Active CN114661903B (en) | 2022-03-03 | 2022-03-03 | Deep semi-supervised text clustering method, device and medium combining user intention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114661903B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116049697A (en) * | 2023-01-10 | 2023-05-02 | 苏州科技大学 | Interactive clustering quality improving method based on user intention learning |
CN117875318A (en) * | 2023-02-27 | 2024-04-12 | 同心县启胜新能源科技有限公司 | Temperature and humidity control method and system for livestock breeding based on Internet of things and cloud platform |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170300564A1 (en) * | 2016-04-19 | 2017-10-19 | Sprinklr, Inc. | Clustering for social media data |
CN110309302A (en) * | 2019-05-17 | 2019-10-08 | 江苏大学 | A kind of uneven file classification method and system of combination SVM and semi-supervised clustering |
CN110516068A (en) * | 2019-08-23 | 2019-11-29 | 贵州大学 | A kind of various dimensions Text Clustering Method based on metric learning |
US20200074280A1 (en) * | 2018-08-28 | 2020-03-05 | Apple Inc. | Semi-supervised learning using clustering as an additional constraint |
CN111046907A (en) * | 2019-11-02 | 2020-04-21 | 国网天津市电力公司 | Semi-supervised convolutional network embedding method based on multi-head attention mechanism |
-
2022
- 2022-03-03 CN CN202210208434.5A patent/CN114661903B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170300564A1 (en) * | 2016-04-19 | 2017-10-19 | Sprinklr, Inc. | Clustering for social media data |
US20200074280A1 (en) * | 2018-08-28 | 2020-03-05 | Apple Inc. | Semi-supervised learning using clustering as an additional constraint |
CN110309302A (en) * | 2019-05-17 | 2019-10-08 | 江苏大学 | A kind of uneven file classification method and system of combination SVM and semi-supervised clustering |
CN110516068A (en) * | 2019-08-23 | 2019-11-29 | 贵州大学 | A kind of various dimensions Text Clustering Method based on metric learning |
CN111046907A (en) * | 2019-11-02 | 2020-04-21 | 国网天津市电力公司 | Semi-supervised convolutional network embedding method based on multi-head attention mechanism |
Non-Patent Citations (1)
Title |
---|
钟将;刘龙海;梁传伟;: "基于成对约束的主动半监督文本聚类", 计算机工程, no. 13, 5 July 2011 (2011-07-05) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116049697A (en) * | 2023-01-10 | 2023-05-02 | 苏州科技大学 | Interactive clustering quality improving method based on user intention learning |
CN117875318A (en) * | 2023-02-27 | 2024-04-12 | 同心县启胜新能源科技有限公司 | Temperature and humidity control method and system for livestock breeding based on Internet of things and cloud platform |
Also Published As
Publication number | Publication date |
---|---|
CN114661903B (en) | 2024-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Long et al. | Sentiment analysis of text based on bidirectional LSTM with multi-head attention | |
CN111444340B (en) | Text classification method, device, equipment and storage medium | |
CN113204952B (en) | Multi-intention and semantic slot joint identification method based on cluster pre-analysis | |
CN106383877B (en) | Social media online short text clustering and topic detection method | |
CN114661903B (en) | Deep semi-supervised text clustering method, device and medium combining user intention | |
CN111325029B (en) | Text similarity calculation method based on deep learning integrated model | |
CN109344399B (en) | Text similarity calculation method based on stacked bidirectional lstm neural network | |
CN110619051B (en) | Question sentence classification method, device, electronic equipment and storage medium | |
CN107085581A (en) | Short text classification method and device | |
CN109933792B (en) | Viewpoint type problem reading and understanding method based on multilayer bidirectional LSTM and verification model | |
CN113392191B (en) | Text matching method and device based on multi-dimensional semantic joint learning | |
CN113672718A (en) | Dialog intention recognition method and system based on feature matching and field self-adaption | |
CN114358109A (en) | Feature extraction model training method, feature extraction model training device, sample retrieval method, sample retrieval device and computer equipment | |
CN114282059A (en) | Video retrieval method, device, equipment and storage medium | |
CN116775497A (en) | Database test case generation demand description coding method | |
CN113032556A (en) | Method for forming user portrait based on natural language processing | |
CN112446405A (en) | User intention guiding method for home appliance customer service and intelligent home appliance | |
CN116167833B (en) | Internet financial risk control system and method based on federal learning | |
CN117216012A (en) | Theme modeling method, apparatus, electronic device, and computer-readable storage medium | |
Jing et al. | Chinese text sentiment analysis based on transformer model | |
CN114398903B (en) | Intention recognition method, device, electronic equipment and storage medium | |
CN115526174A (en) | Deep learning model fusion method for finance and economics text emotional tendency classification | |
CN115906845A (en) | E-commerce commodity title naming entity identification method | |
CN113204971B (en) | Scene self-adaptive Attention multi-intention recognition method based on deep learning | |
CN110516068B (en) | Multi-dimensional text clustering method based on metric learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |