CN113918714A

CN113918714A - Classification model training method, clustering method and electronic equipment

Info

Publication number: CN113918714A
Application number: CN202111150725.5A
Authority: CN
Inventors: 曹宜超
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2022-01-11

Abstract

The disclosure provides a classification model training method, a clustering method and electronic equipment, and relates to the technical field of artificial intelligence, in particular to the technical fields of big data, deep learning, intention recognition and the like. The specific scheme is as follows: acquiring a first label-free text data set; inputting the first label-free text data set into a first classification model for iterative training to obtain a target classification model; the first classification model is subjected to iterative training through the first unlabeled text data set, the loss value of the first classification model in the iterative training process is obtained through calculation of the prediction intention type label and the reference intention type label of the first unlabeled text data set, and the prediction intention type label and the reference intention type label are obtained based on the first classification model in the iterative training process, namely the prediction intention type label and the reference intention type label are obtained based on the first classification model in the iterative training process, so that the model training is realized, and the model training effect can be improved.

Description

Classification model training method, clustering method and electronic equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence such as big data, deep learning and intention recognition, in particular to a classification model training method, a clustering method and electronic equipment.

Background

With the development of artificial intelligence, intelligent technologies such as space-time big data perception and artificial intelligence customer service are continuously deepened into daily life of people, and no matter intention recognition in space-time big data or multi-turn conversation scenes of the artificial intelligence customer service, input is required to be subjected to intention recognition and then relevant operation is carried out. In the course of text intent processing, intent clustering is one of the important rings.

In the clustering process, a semi-supervised clustering mode is often adopted, in the semi-supervised clustering mode, a model needs to be trained firstly, namely model training is an important ring, and in the process of training the model, a labeled text data input model labeled in advance is adopted for training.

Disclosure of Invention

The disclosure provides a classification model training method, a clustering method and electronic equipment.

In a first aspect, an embodiment of the present disclosure provides a classification model training method, where the method includes:

acquiring a first label-free text data set;

inputting the first label-free text data set into a first classification model for iterative training to obtain a target classification model;

in the iterative training process, the loss value of the first classification model is calculated through a prediction intention category label and a reference intention category label of the first label-free text data set, and the prediction intention category label and the reference intention category label are obtained based on the first classification model in the iterative training process.

In the classification model training method of this embodiment, the first classification model is iteratively trained through the first unlabeled text data set, and in the iterative training process, the loss value adopted by the first classification model is calculated through the prediction intention category label of the first unlabeled text data set and the reference intention category label of the first unlabeled text data set, and the prediction intention category label and the reference intention category label are obtained based on the first classification model in the iterative training process, that is, in the iterative training process, the prediction intention category label and the reference intention category label can be obtained based on the first classification model, and the loss value is calculated by using the prediction intention category label and the reference intention category label to implement the training of the model, so that the effect of the model training can be improved.

In a second aspect, an embodiment of the present disclosure provides a clustering method, where the method includes:

performing feature extraction on a text data set to be clustered by using a target language model in a target classification model to obtain a feature vector set of the text data set to be clustered;

performing intention clustering on the text data set to be clustered based on the feature vector set to obtain N clustering clusters, wherein N is a positive integer;

and calculating a loss value adopted by the target classification model training through a prediction intention category label and a reference intention category label of the first label-free text data set.

In a third aspect, an embodiment of the present disclosure provides a classification model training apparatus, including:

the first acquisition module is used for acquiring a first label-free text data set;

the first training module is used for inputting the first label-free text data set into a first classification model for iterative training to obtain a target classification model;

In a fourth aspect, an embodiment of the present disclosure provides a clustering apparatus, where the apparatus includes:

the feature extraction module is used for extracting features of a text data set to be clustered by using a target language model in a target classification model to obtain a feature vector set of the text data set to be clustered;

the clustering module is used for carrying out intention clustering on the text data set to be clustered based on the feature vector set to obtain N clustering clusters, wherein N is a positive integer;

In a fifth aspect, an embodiment of the present disclosure further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the classification model training method of the present disclosure as provided in the first aspect.

In a sixth aspect, an embodiment of the present disclosure further provides an electronic device, including:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the clustering method of the present disclosure as provided in the second aspect.

In a seventh aspect, an embodiment of the present disclosure further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the classification model training method provided in the first aspect of the present disclosure or the clustering method provided in the second aspect of the present disclosure.

In an eighth aspect, an embodiment of the present disclosure provides a computer program product, which includes a computer program that, when executed by a processor, implements the classification model training method provided in the first aspect or the clustering method provided in the second aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic flow chart diagram of a classification model training method according to an embodiment provided by the present disclosure;

FIG. 2 is a flow chart diagram of a clustering method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a classification model training method and a clustering method according to an embodiment provided by the present disclosure;

FIG. 4 is a schematic diagram of a first classification model obtained by training an initial classification model in a classification model training method according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a target classification model obtained by training a first classification model in a classification model training method according to an embodiment of the present disclosure;

FIG. 6 is a flow chart of generating pseudo labels by clustering in a classification model training method according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a classification model training apparatus according to an embodiment provided by the present disclosure;

FIG. 8 is a block diagram of a clustering device according to an embodiment provided by the present disclosure;

FIG. 9 is a block diagram of an electronic device for implementing a classification model training method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As shown in fig. 1, according to an embodiment of the present disclosure, the present disclosure provides a classification model training method, including:

step S101: a first set of unlabeled text data is obtained.

A label-free text data set may be understood as a data set without intent category labels, which may include a plurality of label-free text data for model training. It is understood that the annotation text data set is a data set of intentional category labels, that is, the annotation text data set includes not only a plurality of annotation text data, but also a plurality of intention category labels corresponding to the annotation text data.

Step S102: inputting the first label-free text data set into a first classification model for iterative training to obtain a target classification model;

in the iterative training process, a loss value of the first classification model is obtained through calculation of a prediction intention type label and a reference intention type label of the first label-free text data set, and the prediction intention type label and the reference intention type label are obtained based on the first classification model in the iterative training process.

In this embodiment, the first classification model is trained by using the first unlabeled text data set to obtain a target classification model, so that the target classification model can better learn the semantic knowledge of the first unlabeled text data set. The first classification model corresponds to a loss function, the loss function is of various types, and is not limited in this embodiment, and model training can be understood as that the model reaches a convergence state by minimizing the loss function, parameters of the model can be continuously adjusted under the condition that the model does not converge, output is obtained based on the model again, and model training is completed by continuously iterating and training until the model converges and the loss function reaches a minimum value. In the iterative training process, each iterative training may calculate a value of a loss function adopted by the first classification model according to the predicted intention category label of the first unlabeled text data set by the first classification model and the reference intention category label obtained based on the first classification model, that is, in the iterative training process, the value of the loss function of the first classification model of this embodiment is calculated by the predicted intention category label of the first unlabeled text data set and the reference intention category label, and the predicted intention category label and the reference intention category label are obtained based on the first classification model in the iterative training process, and training is performed without manually labeling data in advance to obtain labeled data. It should be noted that the first classification model includes a first language model and a first classification layer that are connected in sequence, the target classification model may include a target language model and a target classification layer that are connected in sequence, the target language model is a model after the first language model is trained, and the target classification layer is a classification layer after the first classification layer is trained.

In one embodiment, the predicted intention category label is obtained by performing intention classification on the first unlabeled text data set through a first classification model, the reference intention category label is obtained by intention category labels of a plurality of clusters of the first unlabeled text data set, the plurality of clusters are obtained by clustering the first unlabeled text data set based on a feature vector set, and the feature vector set is obtained by performing feature extraction on the first unlabeled text data set through the first classification model.

That is, in the iterative training process, the first unlabeled text data set is input into the first classification model, the predicted intention category label can be output through the first classification model, and the feature vector set of the first unlabeled text data set can be obtained by performing feature extraction on the input first unlabeled text data set through the first classification model, the first unlabeled text data set is clustered based on the feature vector set obtained by the first classification model by using a clustering algorithm (which is not limited in this embodiment), so as to obtain a plurality of clusters of the first unlabeled text data set and intention category labels of the plurality of clusters, for each cluster, the intention category label of the cluster can be used as the reference intention category label of the text data in the cluster, so as to label the reference intention category label corresponding to each text data in the first unlabeled text data set, a reference intention category label of the first set of unlabeled text data is obtained. In each iterative training process, the loss value of the first classification model is calculated through a prediction intention type label and a reference intention type label of the first label-free text data set obtained in the current iterative training overshoot. In the iterative training process, along with the continuous updating (optimization) of the first classification model, the first classification model continuously updates the feature vector generated by the first label-free text data set, and further continuously updates the clustering result obtained through clustering, the reference intention category is continuously updated, and the prediction intention category label generated by the first classification model is also continuously updated.

In one embodiment, before inputting the first label-free text data set into the first classification model for iterative training to obtain the target classification model, the method further includes:

acquiring a second label-free text data set and a first label text data set;

inputting the second label-free text data set into a pre-training language model for training to obtain an intermediate language model;

and constructing an initial classification model based on the intermediate language model and the initial classification layer, and inputting the first labeled text data set into the initial classification model for training to obtain a first classification model, wherein the first classification model comprises a first language model and a first classification layer which are sequentially connected.

In this embodiment, the second label-free text data set may be input into the pre-training language model for training to obtain the intermediate language model, that is, the second label-free text data set is trained on the basis of the pre-training language model, so as to enhance the performance of the obtained intermediate language model. And then constructing an initial classification model by using the intermediate language model and the initial classification layer, wherein the input of the initial classification layer is connected with the output of the intermediate language model, the initial classification model is input through a first labeled text data set for training to obtain a first classification model, the first language model is a model after the intermediate language model is trained, the first classification layer is a classification layer after the initial classification layer is trained, and the prior knowledge in the first labeled text data set can be injected into the classification model at the stage so as to enhance the performance of the obtained first classification model. And inputting the first label-free text data set into the first classification model for iterative training to obtain a target classification model. Therefore, the model training effect can be further improved, and the performance of the obtained target classification model is improved.

It should be noted that the prediction intention category label is output through a first classification layer in the first classification model, and the feature vector set of the first unlabeled text data set is obtained by performing feature extraction on the first unlabeled text data set through a first language model in the first classification model.

As an example, the first set of unannotated text data may be a first unannotated text data set of a target domain (specific domain), the second unannotated text data set may be a second unannotated text data set of the target domain, and the first annotated text data set may also be a first annotated text data set of the target domain.

In one embodiment, during the (M + 1) th iteration training, the intention category labels of the plurality of cluster clusters of the first unlabeled text data set are category labels aligned with the intention category label at the M th time;

and the M-th intention category label is an intention category label of a plurality of clustering clusters of the first unlabeled text data set in the M-th iterative training process, and M is a positive integer.

Because the cluster type numbers (intention type labels) clustered between the M round and the M +1 round may change, the Hungarian algorithm can be adopted to align the cluster clusters generated by the adjacent front and back rounds of iteration, so that each cluster can have a unique intention type label. It should be noted that the intention category labels of the multiple clusters of the first unlabeled text data set in the mth iterative training process are category labels aligned with the intention category labels of the multiple clusters of the first unlabeled text data set in the mth-1 iterative training process, and M-1 is a positive integer.

In the iterative training process of this embodiment, the intention category labels of the multiple cluster clusters of the first unlabeled text data set used in the M +1 th iterative training process are category labels aligned with the intention category labels of the mth iteration, so that the accuracy of the intention category labels can be improved, and the loss values are subsequently calculated by using the intention category labels, so that model training is realized, and the effect of model training can be improved.

As shown in fig. 2, according to an embodiment of the present disclosure, the present disclosure further provides a clustering method, where the method includes:

step 201: and performing feature extraction on the text data set to be clustered by using a target language model in the target classification model to obtain a feature vector set of the text data set to be clustered.

The text data set to be clustered is a text data set to be clustered, and can be understood as a text data set to be clustered without labels.

Step 202: performing intention clustering on a text data set to be clustered based on a feature vector set to obtain N clustering clusters, wherein N is a positive integer;

and calculating the loss value adopted by the training of the target classification model through the prediction intention category label and the reference intention category label of the first label-free text data set.

It should be noted that the target classification model is obtained by training the first classification model using the first unlabeled text data set, and in the process of obtaining the target classification model through training, the loss value used is calculated by using the prediction intention type label and the reference intention type label of the first unlabeled text data set.

In the classification model training method of this embodiment, the loss value used for training the target classification model is obtained by calculating the prediction intention category label and the reference intention category label of the first unlabeled text data set, feature extraction is performed on the text data set to be clustered through the target language model in the target classification model, accuracy of a feature vector set of the obtained text data set to be clustered can be improved, the text data set to be clustered is subjected to intention clustering based on the feature vector set, and N clusters are obtained, so that a clustering effect can be improved, and clustering accuracy is improved.

In one embodiment, after intent clustering is performed on the text data set to be clustered based on the feature vector set to obtain N clustering clusters, the method further includes:

segmenting words of text data in a target cluster and filtering preset segments to obtain a target segment set of the target cluster, wherein the target cluster is any one of N clusters;

calculating the word frequency reverse file frequency value of the word in the target word segmentation set;

determining candidate keywords of the target word segmentation set according to the word frequency reverse file frequency value of the segmentation in the target word segmentation set;

and filtering the text data in the target clustering cluster according to the candidate keywords.

The preset participles can also be understood as common stop words, i.e. stop words, and can be configured in advance, for example, the preset participles can include "ones", "some", "first", and the like. And filtering each cluster in the N clusters through the process, determining candidate keywords and filtering text data in the clusters. It should be noted that the term frequency inverse file frequency value, that is, a TF-IDF (term frequency-inverse document frequency) value, is a product of the term frequency (TF, term frequency) and the inverse file frequency value (IDF, inverse document frequency).

In this embodiment, a word frequency reverse file frequency value of a word in a target word segmentation set in a target cluster can be calculated, a candidate keyword of the target word segmentation set is determined according to the word frequency reverse file frequency value, and text data in the target cluster is filtered according to the candidate keyword, that is, the cluster is denoised, so that the accuracy of the cluster is improved.

In one embodiment, determining candidate keywords of a target word segmentation set according to a word frequency inverse file frequency value of a segmentation word in the target word segmentation set includes:

determining n participles in the target participle set as candidate keywords, wherein n is a positive integer, the word frequency reverse file frequency value of the n participles is greater than the word frequency reverse file frequency value of the rest participles, and the rest participles are the rest participles except the n participles in the target participle set.

That is, in this embodiment, the word frequency reverse file frequency value is taken as the first n segmented words as the candidate keywords of the cluster, so as to improve the accuracy of the candidate keywords, and then the text data in the target cluster is filtered according to the candidate keywords, so as to improve the filtering accuracy, so as to improve the accuracy of the cluster.

In one embodiment, filtering the text data in the target cluster according to the candidate keyword includes:

and filtering the target text data in the target clustering cluster, wherein the number of target keywords in the target text data is less than the preset number, and the target keywords belong to candidate keywords.

That is, if a certain text data of the target cluster includes at least a preset number of keywords in the candidate keywords, the text data does not need to be filtered and remains in the target cluster, and if the certain text data of the target cluster includes the number of keywords in the candidate keywords smaller than the preset number, the text data is filtered and deleted from the target cluster, so that the noise data of the target cluster is filtered, and the accuracy of the cluster is improved.

As an example, before filtering the text data in the target cluster according to the candidate keyword, the method may further include: determining synonyms corresponding to the candidate keywords, and adding the synonyms into the candidate keywords to update the candidate keywords. And subsequently, filtering the text data in the target cluster according to the updated candidate keywords, so as to improve the filtering effect.

In one embodiment, the target classification model is trained by:

acquiring a first label-free text data set;

In one embodiment, the predicted intention category label is obtained by performing intention classification on the first unlabeled text data set through a first classification model, the reference intention category label is obtained by intention category labels of a plurality of clusters of the first unlabeled text data set, the plurality of clusters are obtained by clustering the first unlabeled text data set based on a feature vector set of the first unlabeled text data set, and the feature vector set of the first unlabeled text data set is obtained by performing feature extraction on the first unlabeled text data set through the first classification model.

It should be noted that the target classification model used in the clustering method in the present disclosure may be trained according to the classification model training method in the foregoing embodiment, and details are not repeated here.

The above-mentioned classification model training method and clustering method are specifically described in an embodiment below.

The method provided by the disclosure is mainly formed by combining a large-scale pre-trained language model and a clustering algorithm, and is mainly divided into 4 stages: the stage 1 is to use a large amount of label-free domain data in a specific service domain to perform domain pre-training on a pre-trained language module, and the stage can lead a model to learn a large amount of domain knowledge, thereby having certain gain on downstream classification and clustering tasks; the 2 nd stage is to construct an initial classification model based on an intermediate language model obtained by performing domain training on the pre-trained language model in the 1 st stage, and train the initial classification model by using limited labeled data to obtain a first classification model, wherein the purpose of the step is to inject prior knowledge in the labeled data into the model so as to guide subsequent clustering of the intention theme; the 3 rd stage is to utilize a large amount of label-free data to perform clustering, simultaneously obtain the value of a loss function of the first classification model by using a pseudo label after clustering (namely an intention type label of clustering times) in combination with a self-supervision thought, and continuously adjust the first classification model by using the value of the loss function, so that an intermediate language model in the first classification model generates more reasonable characteristic vector representation, and further adjust the clustered clusters; the first 3 stages are training stages, as shown by the solid line in fig. 3, and the target classification model is obtained after training. The 4 th stage is a prediction stage, as shown by dotted lines in fig. 3, a large number of unlabelled data sets to be clustered are input into a target classification model, feature extraction is performed through a target language model in the trained target classification model, clustering is performed by using an extracted feature vector set, and final clustering cluster output is obtained after filtering is performed in each clustering cluster by using a keyword based on a TF-IDF value, wherein each clustering cluster represents different intention types and contains all texts with the same intention type in the text data sets to be clustered.

For stage 1: intra-domain pre-training:

with the development of large-scale pre-trained language models, most natural language processing tasks can be finely tuned on pre-trained language models, and although the simple fine tuning can achieve a good effect, further breakthrough cannot be made, so that a method for pre-training pre-trained models again by using data in the field becomes a simple and effective strategy. Generally speaking, pre-trained language models such as ERNIE and BERT are pre-trained on a large-scale open domain data set, and they learn general knowledge in the open domain, and although these a priori knowledge have a certain effect on intent clustering, it is often not good to directly use these pre-trained models to perform fine tuning in a specific domain under a specific business scenario. Therefore, in the method proposed by the present disclosure, pre-training of knowledge in the domain is performed again on the basis of the pre-trained language model by using a data set in a specific domain, so that the model learns prior information in the domain, and then fine-tuning training of the model is performed. As shown in FIG. 3, the second label-free dataset is trained using the pre-trained language model, and the pre-trained language model is continuously optimally adjusted using the loss function of the pre-trained language model. For example, in a complaint situation awareness scenario, the general knowledge learned by the model may be generally limited to only distinguish between "complaint" and "non-complaint" sentences, however, after intra-domain pre-training, the model may learn to distinguish between different complaint types in the complaint scenario, such as "hotel environmental hygiene complaint" and "hotel service complaint", thereby making subsequent clustering more accurate.

For stage 2: intention classification pre-training:

semi-supervised learning methods are generally more efficient than fully unsupervised methods, and therefore semi-supervised clustering of intentions with partially labeled data in combination with large amounts of unlabeled data is also used in the methods proposed by the present disclosure. As shown in fig. 4, firstly, labeling intra-domain data in a specific domain by combining methods such as active learning, and the like, to obtain a first labeled data set, then forming an initial classification model of an intention by combining an intermediate language model in the specific domain and the initial classification layer, training the first labeled data by using the initial classification model, performing fine-tuning operation on the initial classification model to obtain a first classification model, so that the first classification model learns intention category knowledge in the first labeled data, and feature vectors generated by the model have a certain degree of distinction on user labeling granularity, which is more beneficial to clustering. For example: "when the test results of the test of the sixth level in english can be found? "and" how to look up test results in the six-level test in english? "the sentence patterns are approximately the same, but the intentions are different, one is query time, the other is query mode, and the model can learn the intention difference through marking the marking information of the data, thereby improving the accuracy of clustering.

For stage 3: self-supervision clustering:

after the first classification model is obtained through training, the first classification model is trained again by using the first unlabeled data set, and the unlabeled data of the first labeled text data set and a large amount of unlabeled data (which may include at least part of the second unlabeled text data set) form the first unlabeled text data set for clustering. In the clustering process, the method provided by the disclosure combines self-supervision learning and alternative training, can generate pseudo labels from the label-free data set and learn by itself, and thus excavates supervision information in the data set. As shown in fig. 5-6, the method first clusters a first unlabeled text data set to form a plurality of clusters, then forms a plurality of clusters according to a threshold, forms different intention categories among different clusters, and performs self-supervised model training using the categories as pseudo labels of the text data in the clusters, specifically: after the second label-free text data set passes through the first classification model, a prediction intention type label is given to each text data, the value of the loss function of the first classification model is obtained by using the pseudo label and the prediction type label obtained by the method, the value of the loss function is used for optimizing the parameters of the first classification model, the updated first classification model generates new feature vector representation according to the first label-free text data set, and then a new pseudo label is generated in the clustering process, so that mutual promotion is continuously performed until the model converges, and the target classification model is obtained. Because the cluster type numbers clustered between the M round and the M +1 round may change, the Hungarian algorithm is adopted to align different clusters generated by iteration of the previous round and the next round, so that each cluster can have a unique pseudo label.

For stage 4: model clustering prediction:

after the model training is completed to obtain the target classification, the target language model in the model is used for carrying out feature extraction on the text data set to be clustered, clustering prediction is carried out on the text data set to be clustered according to the extracted feature vector set by utilizing a clustering algorithm, and N clustering clusters are formed, wherein each clustering cluster comprises at least one piece of text data, and the dotted line in fig. 3 shows that the characteristic vector set is used for carrying out clustering prediction on the text data set to be clustered. In order to enable the text data in each cluster to be as close to the intention category of the cluster as possible, a keyword filtering technology based on TF-IDF can be used for denoising the cluster, and the specific method is as follows: firstly performing word segmentation and word removal for each cluster, then calculating TF-IDF value of each word in the cluster by using TF-IDF, taking top-n words as candidate keywords of the cluster, using the candidate keywords to find out corresponding synonyms by using a synonym dictionary, adding the synonyms into the candidate keywords to form new candidate keywords, finally performing noise data filtration on text data in the cluster by using the candidate keywords of the cluster, and forming a final cluster after filtration.

The method combines pre-training in the field and filtering keywords, is based on a semi-supervised clustering algorithm and combines a field pre-training method, can fully utilize field knowledge without labeled data in the field to enhance the characteristic representation of input text data, also uses a classification task and clustering task alternate training mode to inject the prior knowledge of the labeled data into the characteristic vector representation to improve the clustering effect, and also utilizes a self-supervision strategy to combine the clustered pseudo labels with the prediction intention categories generated by a classification model, so that the classification task and the clustering task mutually promote continuous iteration to generate better clustering clusters, and finally adds a keyword filtering process based on TFIDF to filter and screen more obvious noise data in each clustering cluster, the accuracy after intent clustering is improved, the efficiency of intent summarization is improved, and the labor cost is reduced, for example, when the method is applied to the user intent recognition of space-time big data at present, the original accuracy can be improved by at least 8%.

As shown in fig. 7, the present disclosure also provides a classification model training apparatus 700 according to an embodiment of the present disclosure, the apparatus including:

a first obtaining module 701, configured to obtain a first unlabeled text data set;

a first training module 702, configured to input the first label-free text data set into a first classification model for iterative training, so as to obtain a target classification model;

In one embodiment, the apparatus 700, further comprises:

the second acquisition module is used for acquiring a second label-free text data set and the first label text data set;

the second training module is used for inputting a second label-free text data set into the pre-training language model for training to obtain an intermediate language model;

and the third training module is used for constructing an initial classification model based on the intermediate language model and the initial classification layer, inputting the first labeled text data set into the initial classification model for training, and obtaining a first classification model, wherein the first classification model comprises the first language model and the first classification layer which are sequentially connected.

The classification model training device of each embodiment is a device for implementing the classification model training method of each embodiment, and has corresponding technical features and technical effects, which are not described herein again.

As shown in fig. 8, according to an embodiment of the present disclosure, the present disclosure also provides a clustering apparatus 800, the apparatus 800 including:

the feature extraction module 801 is configured to perform feature extraction on the text data set to be clustered by using a target language model in the target classification model to obtain a feature vector set of the text data set to be clustered;

the clustering module 802 is configured to perform intent clustering on a to-be-clustered text data set based on a feature vector set to obtain N clustering clusters, where N is a positive integer;

In one embodiment, the apparatus 800 further comprises:

the word segmentation set determining module is used for performing word segmentation on the text data in the target clustering cluster and filtering preset words to obtain a target word segmentation set of the target clustering cluster, wherein the target clustering cluster is any one of the N clustering clusters;

the calculation module is used for calculating the word frequency reverse file frequency value of the participle in the target participle set;

the keyword determining module is used for determining candidate keywords of the target word segmentation set according to the word frequency reverse file frequency value of the segmentation in the target word segmentation set;

and the filtering module is used for filtering the text data in the target clustering cluster according to the candidate keywords.

In one embodiment, the target classification model is trained by:

acquiring a first label-free text data set;

The clustering device of each embodiment is a device for implementing the clustering method of each embodiment, and has corresponding technical features and technical effects, which are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

A non-transitory computer readable storage medium of an embodiment of the present disclosure stores computer instructions for causing a computer to perform a classification model training method or a clustering method provided by the present disclosure.

The computer program product of the embodiments of the present disclosure includes a computer program for causing a computer to execute the classification model training method or the clustering method provided by the embodiments of the present disclosure.

FIG. 9 illustrates a schematic block diagram of an example electronic device 900 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901, which can execute various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 909 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated artificial intelligence (I) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 901 performs the respective methods and processes described above, such as a classification model training method or a clustering method. For example, in some embodiments, the classification model training method or the clustering method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When the computer program is loaded into the RAM903 and executed by the computing unit 901, one or more steps of the classification model training method or the clustering method described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform a classification model training method or a clustering method by any other suitable means (e.g., by means of firmware). Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A classification model training method, the method comprising:

acquiring a first label-free text data set;

2. The method of claim 1, wherein the predicted intent category label is derived by the first classification model from an intent classification for the first set of unlabeled text data, the reference intent category label is derived from intent category labels of a plurality of clusters of the first set of unlabeled text data, the plurality of clusters being derived from clustering the first set of unlabeled text data based on a set of feature vectors, the set of feature vectors being derived from feature extraction of the first set of unlabeled text data by the first classification model.

3. The method of claim 1, wherein before inputting the first label-free text data set into the first classification model for iterative training, obtaining the target classification model, the method further comprises:

acquiring a second label-free text data set and a first label text data set;

and constructing an initial classification model based on the intermediate language model and the initial classification layer, inputting the first labeled text data set into the initial classification model for training to obtain the first classification model, wherein the first classification model comprises a first language model and a first classification layer which are sequentially connected.

4. The method of claim 2, wherein, during the M +1 iterative training, the intent category labels of the plurality of clustered clusters of the first unlabeled text dataset are category labels that align with the M-th intent category label;

5. A clustering method, wherein the method comprises:

6. The method according to claim 5, wherein the intent clustering the text data set to be clustered based on the feature vector set to obtain N clusters, further comprising:

segmenting words of text data in a target cluster and filtering preset segments to obtain a target segment set of the target cluster, wherein the target cluster is any one of the N clusters;

calculating the word frequency reverse file frequency value of the participle in the target participle set;

7. The method of claim 6, wherein determining the candidate keywords of the target segmented word set according to the word frequency inverse file frequency values of the segmented words in the target segmented word set comprises:

determining n participles in the target participle set as the candidate keyword, wherein n is a positive integer, the word frequency reverse file frequency value of the n participles is greater than the word frequency reverse file frequency value of the residual participles, and the residual participles are the rest participles in the target participle set except the n participles.

8. The method of claim 6, wherein the filtering the text data in the target cluster according to the candidate keyword comprises:

and filtering target text data in the target clustering cluster, wherein the number of target keywords in the target text data is less than a preset number, and the target keywords belong to the candidate keywords.

9. The method of claim 5, wherein the object classification model is trained by:

acquiring a first label-free text data set;

10. The method of claim 9, wherein the predicted intent category label is derived by the first classification model for intent classification of the first unlabeled text data set, the reference intent category label is derived by intent category labels of a plurality of clusters of the first unlabeled text data set, the plurality of clusters being derived by clustering the first unlabeled text data set based on a set of feature vectors of the first unlabeled text data set, the set of feature vectors of the first unlabeled text data set being derived by feature extraction of the first unlabeled text data set by the first classification model.

11. A classification model training apparatus, the apparatus comprising:

12. The apparatus of claim 11, wherein the predicted intent category label is derived by the first classification model for intent classification for the first set of unlabeled text data, the reference intent category label is derived by intent category labels of a plurality of clusters of the first set of unlabeled text data, the plurality of clusters being derived by clustering the first set of unlabeled text data based on a set of feature vectors, the set of feature vectors being derived by feature extraction of the first set of unlabeled text data by the first classification model.

13. A clustering apparatus, wherein the apparatus comprises:

14. The apparatus of claim 13, further comprising:

the word segmentation set determining module is used for performing word segmentation on text data in a target clustering cluster and filtering preset words to obtain a target word segmentation set of the target clustering cluster, wherein the target clustering cluster is any one of the N clustering clusters;

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the classification model training method of any one of claims 1-4.

16. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the clustering method of any one of claims 5-10.

17. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the classification model training method of any one of claims 1 to 4 or the clustering method of any one of claims 5 to 10.

18. A computer program product comprising a computer program which, when being executed by a processor, carries out the classification model training method according to any one of claims 1 to 4 or the clustering method according to any one of claims 5 to 10.