CN114742018A - Contrast learning level coding text clustering method and system based on confrontation training - Google Patents

Contrast learning level coding text clustering method and system based on confrontation training Download PDF

Info

Publication number
CN114742018A
CN114742018A CN202210646870.0A CN202210646870A CN114742018A CN 114742018 A CN114742018 A CN 114742018A CN 202210646870 A CN202210646870 A CN 202210646870A CN 114742018 A CN114742018 A CN 114742018A
Authority
CN
China
Prior art keywords
training
encoder
clustering
text
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210646870.0A
Other languages
Chinese (zh)
Inventor
郭湘
江岭
黄鹏
郭涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Xiaoduo Technology Co ltd
Original Assignee
Chengdu Xiaoduo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Xiaoduo Technology Co ltd filed Critical Chengdu Xiaoduo Technology Co ltd
Priority to CN202210646870.0A priority Critical patent/CN114742018A/en
Publication of CN114742018A publication Critical patent/CN114742018A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of natural language processing, in particular to a contrast learning level coding text clustering method and system based on confrontation training. The method comprises the following steps: inputting a batch of training sets into an encoder for reconstruction learning by using a contrast learning model as the encoder; adding confrontation training in the training process of the encoder, and guiding the training of the encoder according to the loss function; and clustering the text vectors output by the encoder by using an Infomap algorithm. The invention is based on the superior text vector representation obtained by contrast learning and antagonistic training, and the clustering of the constructed undirected graph is carried out by adopting the Infomap algorithm, so that the superior clustering effect can be realized.

Description

Contrast learning level coding text clustering method and system based on confrontation training
Technical Field
The invention relates to the technical field of natural language processing, in particular to a contrast learning level coding text clustering method and system based on confrontation training.
Background
In current intelligent dialogue systems, whether chatty dialogue or task-based dialogue systems, a huge number of samples need to be labeled manually to determine the intention of each sample, and then the samples are trained for intention recognition by using a deep learning model. However, the above method has at least the following drawbacks: (1) up to several tens of millions of data set labels consume very large human resources; (2) the manual labeling can cause the condition of wrong labeling, and once the labeling is wrong, on one hand, uncertainty of effect is brought to model training, on the other hand, the condition of wrong sample reworking inspection can be caused, and the manual waste is further aggravated.
Disclosure of Invention
The invention aims to provide a contrast learning hierarchical coding text clustering method and system based on countermeasure training, which can obtain better text vector representation through contrast learning and countermeasure training, construct an undirected graph through vector representation, cluster the undirected graph through an Infomap algorithm, realize clustering of similar samples in massive unlabeled texts, divide the samples with the same intention into one cluster, find the intention of each cluster, and aim to solve the problems pointed out in the background art.
The embodiment of the invention is realized by the following technical scheme: a contrast learning level coding text clustering method based on confrontation training comprises the following steps:
s1, inputting a batch of training sets into an encoder for reconstruction learning by using a contrast learning model as the encoder;
s2, adding confrontation training in the training process of the encoder, and guiding the training of the encoder according to the loss function;
and S3, clustering the text vectors output by the encoder by using an Infmap algorithm.
According to a preferred embodiment, the encoder employs one of Bert, roberta, tiny _ Bert, or Bert _ wwm.
According to a preferred embodiment, if the encoder uses Bert, the text vector output by the encoder is subjected to dimension reduction through a multi-layer perceptron MLP.
According to a preferred embodiment, the training set uses unlabeled semblance sentences.
According to a preferred embodiment, the counter-training employs one of FGM, PGD or FreeLB.
According to a preferred embodiment, the confrontation training employs FGM, and step S2 specifically includes:
adding disturbance items into the training set, optimizing a loss function of the training set by using a gradient descent method, updating the parameter weight of the model, and continuously iterating the steps to finish the training of the encoder, wherein the expression is as follows:
Figure 507686DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,Da training set is represented that represents the training set,xthe input is represented by a representation of the input,ythe label is represented by a number of labels,θthe parameters of the model are represented by,Lossrepresenting the loss of training samples, ΔxA term of the disturbance is represented,Ωrepresenting the perturbation space.
According to a preferred embodiment, the disturbance term ΔxThe expression of (a) is as follows:
Figure 106027DEST_PATH_IMAGE002
in the above-mentioned formula, the compound has the following structure,εthe representation of the hyper-parameter is,
Figure DEST_PATH_IMAGE003
to representLossTo pairxThe gradient of (a) of (b) is,
Figure 75120DEST_PATH_IMAGE004
representing the modal length of the gradient.
According to a preferred embodiment, step S3 specifically includes:
expressing a node by a text vector, and constructing an undirected graph expressed by the text by calculating cosine similarity between the nodes;
taking the similarity between the nodes as the probability of random walk, carrying out random walk on the undirected graph, and directly constructing huffman codes according to the probability of the random walk;
coding each node according to the huffman coding, and coding each cluster according to the huffman coding;
and respectively calculating the shortest average coding length of the nodes and the clusters by using a greedy algorithm to finish clustering.
The invention also provides a contrast learning hierarchical coding text clustering system based on confrontation training, which is applied to the method, and comprises the following steps:
the comparison learning module is used for inputting a batch of training sets into the encoder for reconstruction learning by using the comparison learning model as the encoder;
the confrontation training module is used for adding confrontation training in the training process of the encoder and guiding the training of the encoder according to the loss function;
and the hierarchical coding clustering module is used for clustering the text vectors output by the coder by utilizing an Infmap algorithm.
The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects: (1) according to the method and the system, the superior text vector representation can be obtained by using contrast learning plus countertraining, so that the samples can be uniformly distributed on the hypersphere plane as much as possible; (2) based on the superior text vector representation obtained by contrast learning and confrontation training, clustering the constructed undirected graph by adopting an Infomap algorithm, and realizing a superior clustering effect; (3) the number of clusters does not need to be determined in advance, and the workload of extra manual debugging parameters caused by the uncertainty of different sample sizes and types can be reduced; (4) the method can automatically cluster the samples without manual marking, and marking personnel can mark the data in batches according to the clustering result, so that the workload is saved; (5) the method and the system provided by the invention can also be used for judging badcase in the existing statement sample and judging the thickness of semantic granularity.
Drawings
Fig. 1 is a schematic flowchart of a contrast learning hierarchical coded text clustering method based on countermeasure training according to embodiment 1 of the present invention;
fig. 2 is a model framework diagram provided in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Example 1
The applicant researches and discovers that the traditional method for determining the intention of each sample by adopting manual marking has at least the following defects: (1) up to several tens of millions of data set labels consume very large human resources; (2) the manual labeling can cause the condition of wrong labeling, and once the labeling is wrong, on one hand, uncertainty of effect is brought to model training, on the other hand, the condition of wrong sample reworking inspection can be caused, and the manual waste is further aggravated.
Therefore, the embodiment of the invention provides a contrast learning hierarchical coding text clustering method based on countermeasure training, which obtains better text vector representation through contrast learning and countermeasure training, constructs an undirected graph through vector representation, and clusters the undirected graph through an Infomap algorithm, so that clustering of similar samples in massive unlabeled texts can be realized, the samples with the same intention can be divided into one cluster, and the intention of each cluster can be found. The specific scheme is as follows:
referring to fig. 1, the contrast learning hierarchical coding text clustering method based on confrontation training mainly includes two steps, a text vector representing step and an Infomap clustering step, which are described in detail below.
Considering that the Infomap does not have good text vector representation when clustering is carried out, a better clustering effect cannot be realized, and the existing contrast learning vector representation is not enough and is not distributed on a hypersphere plane as uniformly as possible, so that the Infomap is rarely used for clustering.
Based on this, in one implementation manner of the embodiment of the present invention, the text vector representing step includes:
referring to fig. 2, a comparison learning model of a Bert frame is used as an encoder, and a batch of training sets are input into the encoder for reconstruction learning, where the training sets of this embodiment use unlabeled similar sentences. It should be noted that, the encoder described above is implemented based on the department of self-supervision in Bert, and the principle of Bert is described in detail below:
and performing data enhancement on the input to construct a positive sample, wherein the data enhancement mode comprises one of synonym replacement, sentence truncation, reverse translation, punctuation mark addition, unimportant word deletion and word order rearrangement. It should be noted that synonym replacement is added, so that the model can improve the score of synonym similarity, and further improve the text matching capability of the model.
Further, self-supervised comparative learning is performed, specifically as follows: will be provided with
Figure DEST_PATH_IMAGE005
And positive sample
Figure 167841DEST_PATH_IMAGE006
Inputting into a coder for fitting training to obtain two expression vectors
Figure 905990DEST_PATH_IMAGE007
And
Figure 285018DEST_PATH_IMAGE008
as a positive example pair, randomly sampling another input within the batch
Figure 600462DEST_PATH_IMAGE009
As
Figure 355928DEST_PATH_IMAGE010
Negative example of (1); further, the cosine similarity is adopted to respectively calculate the similarity of the expression vector and other vectors in the batch, the calculated cosine similarity is used as a matching score to sequence the candidate texts, and the final loss function is expressed as follows through a softmax function and a final loss function of a cross entropy calculation model:
Figure 683005DEST_PATH_IMAGE011
in the above formula, the first and second carbon atoms are,
Figure 786407DEST_PATH_IMAGE014
indicates an index only wheni = jIs equal to 1 in time, and is,
Figure 79985DEST_PATH_IMAGE015
representing the total number of sentences in the batch,
Figure 527147DEST_PATH_IMAGE016
the temperature is shown to be over-temperature,
Figure 107033DEST_PATH_IMAGE017
representing cosine similarity
Figure 272435DEST_PATH_IMAGE018
Figure 104125DEST_PATH_IMAGE019
Representing cosine similarity
Figure 671372DEST_PATH_IMAGE020
Figure 438471DEST_PATH_IMAGE021
Representing cosine similarity
Figure 91169DEST_PATH_IMAGE022
Figure 726550DEST_PATH_IMAGE023
To represent
Figure 7358DEST_PATH_IMAGE024
Positive sample
Figure 335572DEST_PATH_IMAGE025
Is used to represent a vector of (a) or (b),
Figure 209987DEST_PATH_IMAGE026
to represent
Figure 524425DEST_PATH_IMAGE024
Negative sample
Figure 800685DEST_PATH_IMAGE027
Is used to represent a vector of (a) a,
Figure 34220DEST_PATH_IMAGE028
representing the training parameters of the model.
Further, iterative training with knowledge of network parameters based on the final loss function, in this embodiment, temperature hyper-parameters
Figure 786145DEST_PATH_IMAGE016
The setting was made to be 0.05,
Figure 763328DEST_PATH_IMAGE029
is set to 128; experiments prove that the model can be fitted only by one epoch through the setting, and the method can obtain higher similarity scores for similar sentences with different lengths.
It should be noted that, besides the Bert illustrated above, the encoder may also employ one of roberta, tiny _ Bert, or Bert _ wwm; if the encoder adopts Bert, the text vector output by the encoder passes through a multi-layer perceptron MLP for dimensionality reduction, and details are not repeated.
Furthermore, counter training is added in the training process of the encoder, and the training of the encoder is guided according to the loss function. In an implementation manner of the embodiment of the present invention, the countermeasure training employs FGM, and includes the following specific steps:
adding disturbance items into the training set, optimizing a loss function of the training set by using a gradient descent method, updating the parameter weight of the model, and continuously iterating the steps to complete the training of the encoder, wherein the expression is as follows:
Figure 894095DEST_PATH_IMAGE030
in the above formula, the first and second carbon atoms are,Da training set is represented that represents the training set,xthe input is represented by a representation of the input,ythe label is represented by a number of labels,θthe parameters of the model are represented by,Lossrepresenting the loss of training samples, ΔxA term of the disturbance is represented,Ωrepresenting the perturbation space.
During the whole model training process, the forward calculation comprisesxAndxxrespectively calculating to obtain final Loss, and respectively calculating when backward propagatingxAndxxthe gradient of (2) and then the gradient decreases, and the time consumed for training in the above manner is about twice that of the training without adding the disturbance term.
Wherein the perturbation term ΔxThe expression of (a) is as follows:
Figure 298531DEST_PATH_IMAGE031
in the above formula, the first and second carbon atoms are,εthe representation of the hyper-parameter is,
Figure 22905DEST_PATH_IMAGE032
to representLossTo pairxThe gradient of (a) of (b) is,
Figure 538200DEST_PATH_IMAGE033
representing the modal length of the gradient. Note that ΔxTake values according to a gradient rise, thusxxPost-calculationLossLossIt will increase significantly.
It should be noted that, in addition to the above-mentioned FGM, one of PGD and FreeLB may be used for the perturbation training, and the description thereof is omitted here.
In an implementation manner of the embodiment of the present invention, the step of clustering the Infomap includes:
clustering the text vectors output by the encoder by using an Infmap algorithm, which is specifically as follows:
the method includes the steps that a text vector represents a node, an undirected graph represented by the text is constructed by calculating cosine similarity between the nodes, and it is noted that the edge construction can be carried out only by determining threshold values in advance, for example, setting threshold =0.5, that is, the edge construction and composition can be carried out on two nodes only when the cosine similarity between the two nodes is greater than or equal to 0.5. Further, taking the similarity between the nodes as the probability of random walk, carrying out random walk on the undirected graph, and directly constructing a huffman code according to the probability of the random walk; coding each node according to the huffman coding, and coding each cluster according to the huffman coding; and respectively calculating the shortest average coding length of the nodes and the clusters by using a greedy algorithm to finish clustering. It is understood that the above process does not require the confirmation of the number of clusters, the number of clusters can be automatically confirmed according to the hierarchical coding, and the above method has high efficiency and interpretability.
The embodiment of the invention also provides a contrast learning hierarchical coding text clustering system based on confrontation training, which is applied to the method and is characterized by comprising the following steps:
the comparison learning module is used for inputting a batch of training sets into the encoder for reconstruction learning by using the comparison learning model as the encoder;
the confrontation training module is used for adding confrontation training in the training process of the encoder and guiding the training of the encoder according to the loss function;
and the hierarchical coding clustering module is used for clustering the text vectors output by the coder by utilizing an Infmap algorithm.
In summary, the technical solution of the embodiment of the present invention has at least the following advantages and beneficial effects: (1) according to the method and the system, the superior text vector representation can be obtained by using contrast learning plus countertraining, so that the samples can be uniformly distributed on the hypersphere plane as much as possible; (2) based on better text vector representation obtained by contrast learning and countermeasure training, clustering the constructed undirected graph by adopting an Infmap algorithm, thereby realizing better clustering effect; (3) the number of clusters does not need to be determined in advance, and the workload of extra manual debugging parameters caused by the uncertainty of different sample volumes and category numbers can be reduced; (4) the method can automatically cluster the samples without manual marking, and marking personnel can mark the data in batches according to the clustering result, so that the workload is saved; (5) the method and the system provided by the invention can also be used for judging badcase in the existing statement sample and judging the thickness of semantic granularity.
Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.
The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF or the like, or any combination of the preceding.
Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).
Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.
Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.
The present invention has been described in terms of the preferred embodiment, and it is not intended to be limited to the embodiment. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (9)

1. The contrast learning hierarchical coding text clustering method based on the confrontation training is characterized by comprising the following steps of:
s1, inputting a batch of training sets into an encoder for reconstruction learning by using a contrast learning model as the encoder;
s2, adding confrontation training in the training process of the encoder, and guiding the training of the encoder according to a loss function;
and S3, clustering the text vectors output by the encoder by using an Infmap algorithm.
2. The method as claimed in claim 1, wherein the encoder employs one of Bert, roberta, tiny _ Bert, or Bert _ wwm.
3. The method as claimed in claim 2, wherein if the encoder uses Bert, the text vector outputted from the encoder goes through a multi-level perceptron MLP for dimensionality reduction.
4. The method of claim 1, wherein the training set employs unlabeled similar sentences.
5. The method as claimed in claim 1, wherein the countermeasure training employs one of FGM, PGD or FreeLB.
6. The method as claimed in claim 1, wherein the competitive training employs FGM, and step S2 specifically includes:
adding disturbance items into the training set, optimizing a loss function of the training set by using a gradient descent method, updating the parameter weight of the model, and continuously iterating the steps to complete the training of the encoder, wherein the expression is as follows:
Figure 862942DEST_PATH_IMAGE001
in the above formula, the first and second carbon atoms are,Da training set is represented that represents the training set,xthe input is represented by a representation of the input,ya label is represented that is, for example,θthe parameters of the model are represented by,Lossrepresenting the loss of training samples, ΔxA term of the disturbance is represented,Ωrepresenting the perturbation space.
7. The method as claimed in claim 6, wherein the perturbation term Δ is a hierarchical coding text clustering method for contrast learning based on antagonistic trainingxThe expression of (a) is as follows:
Figure 165748DEST_PATH_IMAGE002
in the above-mentioned formula, the compound has the following structure,εthe representation of the hyper-parameter is,
Figure 381965DEST_PATH_IMAGE003
to representLossTo pairxThe gradient of (a) of (b) is,
Figure 405416DEST_PATH_IMAGE004
representing the modal length of the gradient.
8. The method for clustering contrastive learning hierarchical coded texts based on confrontation training as claimed in claim 1, wherein the step S3 specifically comprises:
expressing a node by a text vector, and constructing an undirected graph expressed by the text by calculating cosine similarity between the nodes;
taking the similarity between the nodes as the probability of random walk, carrying out random walk on the undirected graph, and directly constructing huffman codes according to the probability of the random walk;
coding each node according to the huffman coding, and coding each cluster according to the huffman coding;
and respectively calculating the shortest average coding length of the nodes and the clusters by using a greedy algorithm to finish clustering.
9. A contrast learning hierarchical coded text clustering system based on confrontational training, applied to the method of any one of claims 1 to 8, comprising:
the comparison learning module is used for inputting a batch of training sets into the encoder for reconstruction learning by using the comparison learning model as the encoder;
the confrontation training module is used for adding confrontation training in the training process of the encoder and guiding the training of the encoder according to the loss function;
and the hierarchical coding clustering module is used for clustering the text vectors output by the coder by utilizing an Infmap algorithm.
CN202210646870.0A 2022-06-09 2022-06-09 Contrast learning level coding text clustering method and system based on confrontation training Pending CN114742018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210646870.0A CN114742018A (en) 2022-06-09 2022-06-09 Contrast learning level coding text clustering method and system based on confrontation training

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210646870.0A CN114742018A (en) 2022-06-09 2022-06-09 Contrast learning level coding text clustering method and system based on confrontation training

Publications (1)

Publication Number Publication Date
CN114742018A true CN114742018A (en) 2022-07-12

Family

ID=82287966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210646870.0A Pending CN114742018A (en) 2022-06-09 2022-06-09 Contrast learning level coding text clustering method and system based on confrontation training

Country Status (1)

Country Link
CN (1) CN114742018A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952852A (en) * 2022-12-20 2023-04-11 北京百度网讯科技有限公司 Model training method, text retrieval method, device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673201A (en) * 2021-07-15 2021-11-19 北京三快在线科技有限公司 Text representation vector generation method and device, storage medium and electronic equipment
CN113761192A (en) * 2021-05-18 2021-12-07 腾讯云计算(北京)有限责任公司 Text processing method, text processing device and text processing equipment
CN113837370A (en) * 2021-10-20 2021-12-24 北京房江湖科技有限公司 Method and apparatus for training a model based on contrast learning
CN114003698A (en) * 2021-12-27 2022-02-01 成都晓多科技有限公司 Text retrieval method, system, equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761192A (en) * 2021-05-18 2021-12-07 腾讯云计算(北京)有限责任公司 Text processing method, text processing device and text processing equipment
CN113673201A (en) * 2021-07-15 2021-11-19 北京三快在线科技有限公司 Text representation vector generation method and device, storage medium and electronic equipment
CN113837370A (en) * 2021-10-20 2021-12-24 北京房江湖科技有限公司 Method and apparatus for training a model based on contrast learning
CN114003698A (en) * 2021-12-27 2022-02-01 成都晓多科技有限公司 Text retrieval method, system, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DANIELA N. RIM 等: "Adversarial Training with Contrastive Learning in NLP", 《ARXIV:2109.09075V1》 *
TRIANA DEWI SALMA 等: "Text Classification Using XLNet with Infomap Automatic Labeling Process", 《2021 8TH INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS (ICAICTA)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115952852A (en) * 2022-12-20 2023-04-11 北京百度网讯科技有限公司 Model training method, text retrieval method, device, electronic equipment and medium
CN115952852B (en) * 2022-12-20 2024-03-12 北京百度网讯科技有限公司 Model training method, text retrieval method, device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN110737758B (en) Method and apparatus for generating a model
CN111444340B (en) Text classification method, device, equipment and storage medium
CN113505244B (en) Knowledge graph construction method, system, equipment and medium based on deep learning
CN112256860B (en) Semantic retrieval method, system, equipment and storage medium for customer service dialogue content
CN114003698B (en) Text retrieval method, system, equipment and storage medium
CN112528677B (en) Training method and device of semantic vector extraction model and electronic equipment
CN112380863A (en) Sequence labeling method based on multi-head self-attention mechanism
CN111666500A (en) Training method of text classification model and related equipment
CN113392209A (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN113268560A (en) Method and device for text matching
CN112528654A (en) Natural language processing method and device and electronic equipment
CN113987174A (en) Core statement extraction method, system, equipment and storage medium for classification label
CN113761190A (en) Text recognition method and device, computer readable medium and electronic equipment
CN114742018A (en) Contrast learning level coding text clustering method and system based on confrontation training
CN115203419A (en) Language model training method and device and electronic equipment
CN111597807A (en) Method, device and equipment for generating word segmentation data set and storage medium thereof
CN112148879B (en) Computer readable storage medium for automatically labeling code with data structure
CN115129826B (en) Electric power field model pre-training method, fine tuning method, device and equipment
CN116680575A (en) Model processing method, device, equipment and storage medium
CN114239583B (en) Method, device, equipment and medium for training entity chain finger model and entity chain finger
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device
CN114911940A (en) Text emotion recognition method and device, electronic equipment and storage medium
CN113392929A (en) Biological sequence feature extraction method based on word embedding and self-encoder fusion
Bauer et al. Optimizing for measure of performance in max-margin parsing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220712