CN110297909B

CN110297909B - Method and device for classifying unlabeled corpora

Info

Publication number: CN110297909B
Application number: CN201910602361.6A
Authority: CN
Inventors: 刘华杰; 李晓萍; 张宏韬
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2021-07-02
Anticipated expiration: 2039-07-05
Also published as: CN110297909A

Abstract

The invention provides a method and a device for classifying unlabeled corpora, wherein the method comprises the following steps: obtaining a non-tag corpus, wherein the non-tag corpus comprises at least one question; inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer. The device is used for executing the method. The method and the device for classifying the unlabeled corpora provided by the embodiment of the invention improve the accuracy of classifying the unlabeled corpora.

Description

Method and device for classifying unlabeled corpora

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for classifying unlabeled corpora.

Background

With the development of artificial intelligence technology, the dialog robot based on artificial intelligence is widely applied to many fields such as customer service, outbound, sales, intelligent search and the like, and the intention recognition is used as a core technology in the dialog robot system to directly determine the accuracy rate of dialog and the user experience.

At present, in the technology of intention recognition, a more effective technology is a deep learning model, and the deep learning model obtained through training can realize the classification of unlabeled corpora and is beneficial to the recognition of intentions. However, training of the deep learning model requires collecting a large number of labeled samples, which is time-consuming and labor-consuming, and accumulation of a large amount of labeled data (i.e., sample data) requires a long time, and a large amount of labeled data with high quality is very expensive. In addition, the parameters of the deep learning model are too many, so that the 'overfitting' is easily generated when the sample data is less, and the model is very sensitive to noise data. In order to solve the overfitting problem caused by less sample data in the prior art, a simple model is adopted when the model is selected, technologies such as denoising and sample expansion are adopted in the aspect of data processing, but the problem that the deep learning model obtained by training is not accurate enough due to too less sample data is still difficult to solve, so that the classification accuracy of the unlabeled corpus is very low, and the application of the deep learning model is influenced.

Disclosure of Invention

For solving the problems in the prior art, embodiments of the present invention provide a method and an apparatus for classifying unlabeled corpora, which can at least partially solve the problems in the prior art.

On one hand, the invention provides a method for classifying unlabeled corpora, which comprises the following steps:

obtaining a non-tag corpus, wherein the non-tag corpus comprises at least one question;

inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

In another aspect, the present invention provides a device for classifying unlabeled corpus, including:

an obtaining unit, configured to obtain a non-tag corpus, where the non-tag corpus includes at least one question;

the classification unit is used for inputting each question included in the unlabeled corpus into the text classification model and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

In still another aspect, the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for building a text classification model based on unlabeled corpora according to any of the above embodiments.

The method and the device for classifying the unlabeled corpora provided by the embodiment of the invention can obtain the unlabeled corpora, then input each problem included in the unlabeled corpora into the text classification model obtained after training based on the unlabeled corpus sample, output the label corresponding to each problem, and classify the unlabeled corpora through the text classification model obtained after training the unlabeled corpus sample under the condition that high-quality labeled samples are difficult to obtain or are insufficient in quantity, so that the accuracy of classifying the unlabeled corpora is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:

fig. 1 is a flowchart illustrating a method for classifying unlabeled corpora according to an embodiment of the present invention.

FIG. 2 is a flowchart illustrating a method for classifying unlabeled corpora according to another embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a first training model according to an embodiment of the present invention.

FIG. 4 is a flowchart illustrating a method for classifying unlabeled corpora according to another embodiment of the present invention.

FIG. 5 is a flowchart illustrating a method for classifying unlabeled corpora according to another embodiment of the present invention.

Fig. 6 is a schematic structural diagram of a Seq2Seq model according to an embodiment of the present invention.

Fig. 7 is a schematic structural diagram of a system for building a text classification model based on unlabeled corpus according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of a device for classifying unlabeled corpora according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of a device for classifying unlabeled corpora according to another embodiment of the present invention.

FIG. 10 is a schematic structural diagram of an apparatus for classifying unlabeled corpora according to another embodiment of the present invention.

Fig. 11 is a schematic structural diagram of a device for classifying unlabeled corpora according to another embodiment of the present invention.

Fig. 12 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

Fig. 1 is a schematic flow chart of a method for classifying unlabeled corpora according to an embodiment of the present invention, and as shown in fig. 1, the method for classifying unlabeled corpora according to an embodiment of the present invention includes:

s101, obtaining a non-tag corpus, wherein the non-tag corpus comprises at least one question;

specifically, whether the robot is a manual customer service robot or a dialogue robot, a voice dialogue is generated in the process of serving a customer, the voice dialogue can be converted into a text through a voice recognition technology, and problems in the text can be collected to be used as a non-tag corpus, the non-tag corpus is text data which is not classified, and the non-tag corpus comprises at least one problem.

S102, inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

Specifically, after the unlabeled corpus is obtained, each question in the unlabeled corpus is used as an input of a text classification model, and a label corresponding to each question can be output through processing of the text classification model, where the label is used to identify a type to which the question belongs. The text classification model is obtained after being trained on unlabeled corpus samples, each corpus data in the unlabeled corpus samples comprises a question and an answer, the answer is a solution to the question, the answer can also be considered to be labeled to the question, the number of corpus data included in the unlabeled corpus samples is set according to actual needs, and the embodiment of the invention is not limited. The specific training process of the text classification model is described below, and is not described herein again. The execution subject of the embodiment of the present invention includes, but is not limited to, a computer.

The method for classifying the unlabeled corpora provided by the embodiment of the invention can obtain the unlabeled corpora, then input each question included in the unlabeled corpora into the text classification model obtained after training based on the unlabeled corpus sample, and output the label corresponding to each question, and under the condition that high-quality labeled samples are difficult to obtain or the quantity is insufficient, the unlabeled corpora can be classified through the text classification model obtained after training the unlabeled corpus sample, so that the accuracy of the unlabeled corpora classification is improved.

Fig. 2 is a schematic flow chart of a method for classifying unlabeled corpus according to another embodiment of the present invention, and as shown in fig. 2, the step of obtaining the text classification model based on the unlabeled corpus sample training includes:

s201, training a model of a coding-decoding framework based on the unlabeled corpus sample to obtain a pre-training model, wherein the pre-training model comprises a coding layer;

specifically, the unlabeled corpus sample is input into a model of an encoding-decoding framework, and the model of the encoding-decoding framework is trained to obtain a pre-trained model, wherein the pre-trained model comprises an encoding layer. Encoding, namely converting an input sequence into a vector with a fixed length; and decoding, namely converting the previously generated vector with fixed length into an output sequence, wherein in the training process of the model of the coding-decoding framework, the question included in each piece of linguistic data in the unlabeled linguistic data sample is correspondingly coded, and the answer included in each piece of linguistic data is correspondingly decoded, so that semantic codes with better representation and generalization capability can be obtained. The Encode-Decoder framework is a model framework in deep learning, and the model of the Encode-Decoder framework includes, but is not limited to, a Sequence to Sequence (Seq 2Seq) model.

For example, in order to obtain the unlabeled corpus sample, the stored customer service recording may be offline transcribed into a text by using an Automatic Speech Recognition technology (ASR for short) to obtain an original corpus; and then, carrying out manual proofreading processing on the conversation scene in the original corpus to obtain the unlabelled voice sample, wherein the proofreading includes but is not limited to error correction, sentence alignment and other processing, and each piece of corpus data in the obtained unlabelled corpus sample includes a question and an answer. The corpus data is, for example: question what do you get you, what do you ask bank's rate of house loan? The answer is: your good, the current interest rate is 5.6%. Or, a problem: how much does my credit card amount be present? The answer is: you are good, and your current amount is 5 ten thousand RMB.

S202, sampling the unlabeled corpus samples to obtain first training samples, wherein each piece of training sample data in the first training samples comprises a question and a corresponding label;

specifically, after obtaining the unlabeled corpus sample, sampling the unlabeled corpus sample to obtain a first training sample, where each piece of training sample data in the first training sample includes a question and a label corresponding to the question. The tag is used to identify the type to which the question belongs, and is preset. Each piece of training sample data corresponds to the corpus data in one unlabeled corpus sample, and the problems included in the training sample data are the same as the problems included in the corresponding corpus data. It is understood that step S202 and step S201 have no precedence relationship.

For example, firstly, clustering the unlabeled corpus samples to obtain classified unlabeled corpus samples of a preset category; then, sampling each type of unlabeled corpus sample to obtain an original sample, and manually labeling the original sample, wherein the labeling is to classify each corpus data in the original sample to obtain a label corresponding to each corpus data; and obtaining the first training sample according to the question included in each corpus data of the labeled original sample and the label corresponding to each corpus data. The preset category is set according to actual needs, and the embodiment of the invention is not limited; the proportion or the number of the samples of each type of unlabeled corpus samples is set according to actual needs, and the embodiment of the invention is not limited.

S203, training a first training model based on the first training sample to obtain an initial classification model; the first training model comprises a classification layer and a coding layer of the pre-training model, and parameters of the coding layer are kept unchanged in the training process of the first training model;

specifically, a first training model is established, the first training model comprises an encoding (Encoder) layer and a classification layer of the pre-training model, the output of the Encoder layer is used as the input of the classification layer, and the classification layer can adopt a Softmax algorithm. Taking the problem included in each piece of training sample data of the first training sample as the input of the first training model, taking the label included in each piece of training sample data of the first training sample as the output of the first training model, keeping the parameters of the Encoder layer unchanged, and training the first training model to obtain an initial classification model, wherein the initial classification model comprises the Encoder layer. In the training process of the first training model, parameters of the Encoder layer are kept unchanged, and the parameters of the classification layer are obtained through training, so that the relation between unlabeled corpus samples is utilized, the supervised training classification layer can be provided, a text classification task is completed, and an initial classification model with high generalization can be obtained through training with fewer samples. Because the first training model adopts the coding layer of the pre-training model, the number of training sample data in the first training sample required in model training can be reduced, the cost and time for obtaining the first training sample are saved, the training efficiency of the first training model is improved, and the obtaining efficiency of the text classification model is also improved.

For example, fig. 3 is a schematic structural diagram of a first training model provided in an embodiment of the present invention, and as shown in fig. 3, the first training model includes a Word embedding layer, an Encoder layer, and a Softmax layer. The question included in each piece of training sample data of the first training sample is, for example: asking for that you are XXX, inputting the XXX into a Word embedding layer, converting each Word in the question into a Word vector with a fixed length by the Word embedding layer, outputting the Word vector to an Encoder layer, processing the input Word vector by the Encoder layer, outputting a state variable C to a Softmax layer, taking the state variable C as an initial value of the Softmax layer, and outputting a label corresponding to the question after the Softmax layer is trained, such as outputting the question: you ask you if you are XXX, and the corresponding label asks if you are your own.

S204, obtaining a supplementary training sample based on the residual unlabeled corpus samples and the initial classification model; wherein the remaining unlabeled corpus samples are obtained by removing the unlabeled corpus sample corresponding to the first training sample from the unlabeled corpus sample;

specifically, after the initial classification model is obtained, the initial classification model may be used to label the remaining unlabeled corpus samples, so as to obtain labels corresponding to each corpus data of the remaining unlabeled corpus samples, that is, the question included in each corpus data of the remaining unlabeled corpus samples is used as the input of the initial classification model, so as to obtain the labels corresponding to each corpus data of the remaining unlabeled corpus samples. Then, according to the question and the corresponding label included in each piece of corpus data of the remaining unlabeled corpus samples, a supplementary training sample can be obtained, wherein each piece of training sample of the supplementary training sample includes a question and a corresponding label. The remaining unlabeled corpus samples are obtained by removing the unlabeled corpus sample corresponding to the first training sample from the unlabeled corpus samples, and each piece of training sample data of the first training sample corresponds to one piece of corpus data of the unlabeled corpus sample.

S205, training the initial classification model based on a second training sample to obtain the text classification model; wherein the second training samples comprise the first training samples and the supplementary training samples, and parameters of an encoding layer included in the initial classification model are kept unchanged in the training process of the initial classification model.

Specifically, the supplementary training sample and the first training sample are combined together to serve as a second training sample, and each piece of training sample data of the second training sample comprises a question and a corresponding label. And taking the problem included in each piece of training sample data of the second training sample as the input of the initial classification model, taking the label included in each piece of training sample data of the second training sample as the output of the initial classification model, keeping the parameters of the Encoder layer of the initial classification model unchanged in the training process, and training the initial classification model to obtain a text classification model.

The method for classifying the unlabeled corpus provided by the embodiment of the invention can establish the text classification model without a large number of labeled training samples, so that the accuracy of establishing the text classification model is improved. Because the unlabeled corpus samples are very convenient to obtain, the method for classifying the unlabeled corpus provided by the embodiment of the invention can still establish a text classification model under the condition that high-quality labeled samples are difficult to obtain or the quantity is insufficient, so that the problem of overfitting is avoided.

Fig. 4 is a schematic flow chart of a method for classifying unlabeled corpus according to another embodiment of the present invention, and as shown in fig. 4, on the basis of the foregoing embodiments, further performing sampling processing on the unlabeled corpus sample to obtain a first training sample includes:

s2021, clustering the unlabeled corpus samples to obtain unlabeled corpus samples of preset categories;

specifically, a clustering algorithm may be used to cluster the unlabeled corpus samples to obtain unlabeled corpus samples of preset categories, that is, each preset category corresponds to a certain amount of corpus data of the unlabeled corpus samples, and each corpus data of the unlabeled corpus samples corresponds to one preset category. The preset category is set according to actual experience, and the embodiment of the invention is not limited; the clustering algorithm can adopt LDA (latent Dirichlet allocation) or K-means clustering algorithm. It can be understood that the preset categories are the number of categories in clustering, and the label corresponding to each preset category is not known.

S2022, sampling the unlabeled corpus sample of each preset category to obtain an original sample;

specifically, in order to reduce the amount of corpus data of the unlabeled corpus sample to be labeled, the unlabeled corpus sample of each preset category is sampled, that is, a certain amount of corpus samples are obtained from the unlabeled corpus sample of each preset category, so as to obtain an original sample. The proportion or the number of the samples is set according to actual needs, and the embodiment of the invention is not limited.

S2023, obtaining the first training sample according to the marked original sample.

Specifically, after the original sample is obtained, manually labeling each corpus data of the original sample according to semantics to a preset category, so as to obtain a label corresponding to each corpus data of the original sample, thereby obtaining a labeled original sample. And obtaining a question included in each piece of corpus data of the labeled original sample and a label corresponding to each piece of corpus data to obtain the first training sample, wherein each piece of training sample data of the first training sample includes a question and a corresponding label.

Fig. 5 is a schematic flow chart of a method for classifying unlabeled corpus according to another embodiment of the present invention, as shown in fig. 5, based on the foregoing embodiments, further, the obtaining a supplementary training sample based on the remaining unlabeled corpus samples and the initial classification model includes:

s2041, labeling the remaining unlabeled corpus samples through the initial classification model to obtain labels corresponding to each corpus data in the remaining unlabeled corpus samples;

specifically, after the initial classification model is obtained, the question included in each piece of corpus data of the remaining unlabeled corpus samples is used as the input of the initial classification model, so that the label corresponding to each piece of corpus data of the remaining unlabeled corpus samples can be obtained.

S2042, obtaining the supplementary training sample according to the label corresponding to each piece of corpus data in the remaining unlabeled corpus samples and the problem included in each piece of corpus data.

Specifically, a supplementary training sample can be obtained according to a question included in each piece of corpus data of the remaining unlabeled corpus sample and a label corresponding to each piece of corpus data, where each piece of training sample data of the supplementary training sample includes a question and a corresponding label, each piece of corpus data of the remaining unlabeled corpus sample corresponds to one piece of training sample data of the supplementary training sample, and the corpus data of the remaining unlabeled corpus sample and the corresponding training sample data include the same question.

On the basis of the above embodiments, further, the coding-decoding framework-based model adopts a sequence-to-sequence model.

In particular, the model based on the coding-decoding framework adopts a sequence-to-sequence model, and the Seq2Seq model is suitable for machine translation, text summarization, conversation and other scenes. The embodiment of the invention adopts the Seq2Seq model to train to obtain the pre-training model, can better express the relation between sentences, and the Encoder layer of the pre-training model can better express the sentences. The Seq2Seq model may adopt a Long Short-Term Memory (LSTM) or a Gated Repeat Unit (GRU) algorithm to realize the Encoder layer and the Decoder layer, or may adopt a Transformer algorithm to realize the Encoder-Decoder layer. The GRU algorithm is faster than the LSTM algorithm in training speed, and the Transformer algorithm is better in model effect due to the fact that an attention mechanism is added.

The Seq2Seq model implemented using LSTM will be described below. Fig. 6 is a schematic structural diagram of a Seq2Seq model according to an embodiment of the present invention, and as shown in fig. 6, the Seq2Seq model includes two Word Embedding layers, an Encoder layer, and a Decoder layer. The Word Embedding layer is used to convert each Word of the input sentence into a fixed length Word vector. The Encoder layer and the Decoder layer are both realized by LSTM. Inputting the question included in each corpus data of the unlabeled corpus sample into a Word Embedding layer 1, for example, the question is: asking you to ask you that XXX, Word Embedding layer 1 converts each Word in the above problem into a Word vector with fixed length, and then outputs the Word vector to the Encoder layer, and the Encoder layer processes the input Word vector and outputs a state variable C to the Decoder layer as an initial value of the Decoder layer. The answer included in each corpus data of the unlabeled corpus sample is input to the Word Embedding layer 2, for example, the answer is: my is, Word Embedding layer 2 converts each Word in the above answer into a Word vector of fixed length, and then outputs to the Decoder layer. The Decoder layer finally outputs the answer corresponding to the question through the learning of the state variable C. Wherein GO and EOS are special characters, GO represents the start of the answer, and EOS represents the end of the answer.

Fig. 7 is a schematic structural diagram of a system for building a text classification model based on a unlabeled corpus according to an embodiment of the present invention, and as shown in fig. 7, the system for building a text classification model based on a unlabeled corpus according to an embodiment of the present invention may be used to build the text classification model. The system comprises: a dialogue corpus generating apparatus 1, a sample generating apparatus 2, a model training apparatus 3, a sample storage apparatus 4, and a model storage apparatus 5, wherein:

the dialogue corpus generating device 1 is used for collecting and generating unlabeled corpus samples. The dialogue corpus generating apparatus 1 includes an offline speech transcription unit 101 and a collation unit 102. The offline voice transcription unit 101 is configured to transcribe the dialogue recording corpus into a text, and the proofreading unit 102 is configured to proofread the transcribed text and generate the unlabeled corpus sample. The collation unit 102 may send the generated unlabeled corpus sample to the sample storage 4 for storage. Wherein the proofreading includes but is not limited to dialog alignment, manual error correction, and the like. The unlabeled dialogue sample obtained by the dialogue corpus generating apparatus 1 is exemplified as follows:

the problems are as follows: how much do you like asking the bank to ask for room credit? The answer is: your best, the current interest rate is xx.

The problems are as follows: how much does my credit card amount be present? The answer is: you good, your current quota is xxx.

The sample generating device 2 is configured to generate a first training sample according to the unlabeled corpus sample. The sample generation apparatus 2 includes a sample clustering unit 201, a sample sampling unit 202, and a sample labeling unit 203. The sample clustering unit 201 is configured to perform clustering on the unlabeled corpus samples by using a clustering algorithm to obtain unlabeled corpus samples of a preset category. The sample clustering unit 201 may obtain the unlabeled corpus sample from the sample storage device 4; the clustering algorithm may be an LDA or K-means algorithm. The sample sampling unit 202 is configured to sample the unlabeled corpus sample of each preset category to obtain an original sample. The sample labeling unit 203 is configured to obtain the first training sample according to the labeled original sample. And marking the original sample by manually marking a label for each preset category. The sample labeling unit 203 may send the first training sample to the sample storage 4 for storage in preparation for subsequent initial model training. Examples of the first training sample are as follows:

label _ query business you good, what is the rate of the asking bank's house loan?

label _ query how much does my credit card credit limit now?

label _ transact credit card business i want to do a credit card.

The label is the label, and the label _ query service, the label _ query service and the label _ credit card transaction service are set according to actual experience. The sentence after the label _ query service, the label _ query service and the label _ transacting the credit card service space is the problem corresponding to each label.

The model training device 3 is used for training according to the training data to obtain the corresponding model. The model training apparatus 3 includes a pre-training model unit 301, an initial model training unit 302, and a final model training unit 303. The pre-training model unit 301 may obtain the unlabeled corpus sample from the sample storage device 4, and is configured to perform model training using an algorithm of an Encoder-Decoder framework according to the unlabeled corpus sample to obtain a pre-training model. The algorithm of the Encoder-Decoder framework can adopt a Seq2Seq algorithm and a Seq2Seq algorithm, and is commonly used in scenes such as machine translation, text summarization, conversation and the like. In the embodiment of the invention, the Seq2Seq algorithm is used as the pre-training model algorithm, the pre-training model obtained by training can better express the relation between sentences, and the Encoder layer can better express the sentences.

The initial model training unit 302 is configured to train a first training model including an Encoder layer and a classification layer of the pre-training model according to the first training sample, so as to obtain an initial classification model. And the Encoder layer parameters are fixed and unchangeable during training, and the parameters of the classification layer are trained. Therefore, the relation among all the corpus data of the unlabeled corpus sample is utilized, and a supervised training classifier can be provided to complete a text classification task. The method can obtain the initial classification model with higher generalization by utilizing a small number of samples for training. After the initial classification model is obtained, labeling the remaining unlabeled corpus samples by using the initial classification model, finally obtaining a supplementary training sample, and combining the first training sample and the supplementary training sample to obtain a second training sample with more training data than the first training sample. The final model training unit 303 is configured to train the initial classification model according to the second training sample, so as to obtain a text classification model. And training the parameters of the classification layer of the initial classification model, wherein the Encoder layer parameters of the initial classification model are fixed and unchangeable during training.

The sample storage device 4 is used for storing sample data, and the sample storage device 4 includes an unlabeled sample storage unit 401 and a sample data storage unit 402. The unlabeled sample storage unit 401 is configured to store the unlabeled corpus sample. The sample data storage unit 402 is configured to store the first training sample and the second training sample.

The model storage means 5 is used to store models. The model storage means 5 includes a pre-training model storage unit 501, an initial classification model storage unit 502, and a final model storage unit 503. The pre-training model storage unit 501 is configured to store the pre-training model, the initial classification model storage unit 502 is configured to store the initial classification model, and the final model storage unit 503 is configured to store the text classification model.

After the text classification model is applied to practice, more corpus data can be obtained along with the lapse of time, so that the unlabeled corpus sample, the first training sample and the second training sample can be expanded. The model of the coding-decoding framework can be retrained by using the unlabeled corpus sample after being expanded, the pre-training model is updated, and the first training model can be updated correspondingly after the pre-training model is updated; the updated first training model can be retrained using the expanded first training sample, and the initial classification model is updated; the second training sample after being expanded can be used for training the updated initial classification model and updating the text classification model, so that a mechanism for continuous iterative updating is established, the text classification model can be continuously optimized and updated, and the accuracy of the text classification model is further improved.

Fig. 8 is a schematic structural diagram of a device for classifying unlabeled corpora according to an embodiment of the present invention, and as shown in fig. 8, the device for classifying unlabeled corpora according to an embodiment of the present invention includes an obtaining unit 801 and a classifying unit 802, where:

the obtaining unit 801 is configured to obtain a non-tag corpus, where the non-tag corpus includes at least one question; the classification unit 802 is configured to input each question included in the unlabeled corpus into the text classification model, and output a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

Specifically, no matter the service is manual, or the robot is a dialogue robot, a voice dialogue is generated in the process of serving a customer, the voice dialogue can be converted into a text through a voice recognition technology, problems in the text can be collected to be used as a non-labeled corpus, the non-labeled corpus is text data which is not classified, and the obtaining unit 801 can obtain the non-labeled corpus, where the non-labeled corpus includes at least one problem.

After obtaining the unlabeled corpus, the classifying unit 802 uses each question in the unlabeled corpus as an input of a text classification model, and may output a label corresponding to each question through processing of the text classification model, where the label is used to identify a type to which the question belongs. The text classification model is obtained after being trained on unlabeled corpus samples, each corpus data in the unlabeled corpus samples comprises a question and an answer, the answer is a solution to the question, the answer can also be considered to be labeled to the question, the number of corpus data included in the unlabeled corpus samples is set according to actual needs, and the embodiment of the invention is not limited. The specific training process of the text classification model is described below, and is not described herein again.

The classification device for the unlabeled corpus provided by the embodiment of the invention can obtain the unlabeled corpus, then input each problem included in the unlabeled corpus into the text classification model obtained after the sample training based on the unlabeled corpus, and output the label corresponding to each problem, so that under the condition that high-quality labeled samples are difficult to obtain or the quantity is insufficient, the unlabeled corpus can be classified through the text classification model obtained after the sample training of the unlabeled corpus, and the accuracy of the unlabeled corpus classification is improved.

Fig. 9 is a schematic structural diagram of a device for classifying unlabeled corpora according to another embodiment of the present invention, and as shown in fig. 9, the device for classifying unlabeled corpora according to an embodiment of the present invention further includes a pre-training unit 803, a sample obtaining unit 804, an initial training unit 805, a sample supplementing unit 806, and a model building unit 807, where:

the pre-training unit 803 is configured to train a model of the encoding-decoding framework based on the unlabeled corpus sample to obtain a pre-training model, where the pre-training model includes an encoding layer; the sample obtaining unit 804 is configured to sample the unlabeled corpus sample to obtain a first training sample, where each piece of training sample data in the first training sample includes a question and a corresponding label; the initial training unit 805 is configured to train a first training model based on the first training sample to obtain an initial classification model; the first training model comprises a classification layer and a coding layer of the pre-training model, and parameters of the coding layer are kept unchanged in the training process of the first training model; the sample supplementing unit 806 is configured to obtain a supplemented training sample based on the remaining unlabeled corpus samples and the initial classification model; wherein the remaining unlabeled corpus samples are obtained by removing the unlabeled corpus sample corresponding to the first training sample from the unlabeled corpus sample; the model establishing unit 807 is configured to train the initial classification model based on a second training sample to obtain the text classification model; wherein the second training samples comprise the first training samples and the supplementary training samples, and parameters of an encoding layer of the initial classification model are kept unchanged in the training process of the initial classification model.

Specifically, the pre-training unit 803 inputs the unlabeled corpus sample into a model of the encoding-decoding framework, and trains the model of the encoding-decoding framework to obtain a pre-training model, where the pre-training model includes an encoding layer. Encoding, namely converting an input sequence into a vector with a fixed length; and decoding, namely converting the previously generated vector with fixed length into an output sequence, wherein in the training process of the model of the coding-decoding framework, the question included in each piece of linguistic data in the unlabeled linguistic data sample is correspondingly coded, and the answer included in each piece of linguistic data is correspondingly decoded, so that semantic codes with better representation and generalization capability can be obtained. Wherein the Encoder-Decoder framework is a model framework in deep learning, and the model of the Encoder-Decoder framework includes but is not limited to a sequence-to-sequence model.

After obtaining the unlabeled corpus sample, the sample obtaining unit 804 performs sampling processing on the unlabeled corpus sample, so as to obtain a first training sample, where each piece of training sample data in the first training sample includes a question and a label corresponding to the question. The tag is used to identify the type to which the question belongs, and is preset. Each piece of training sample data corresponds to the corpus data in one unlabeled corpus sample, and the problems included in the training sample data are the same as the problems included in the corresponding corpus data.

Specifically, the initial training unit 805 establishes a first training model, where the first training model includes an encoding layer and a classification layer of the pre-training model, an output of the Encoder layer is used as an input of the classification layer, and the classification layer may employ a Softmax algorithm. The initial training unit 805 takes the problem included in each piece of training sample data of the first training sample as an input of the first training model, takes the label included in each piece of training sample data of the first training sample as an output of the first training model, keeps the parameter of the Encoder layer unchanged, trains the first training model, and may obtain an initial classification model, where the initial classification model may include the Encoder layer. In the training process of the first training model, parameters of the Encoder layer are kept unchanged, and the parameters of the classification layer are obtained through training, so that the relation between unlabeled corpus samples is utilized, the supervised training classification layer can be provided, a text classification task is completed, and an initial classification model with high generalization can be obtained through training with fewer samples.

After obtaining the initial classification model, the sample supplementing unit 806 may label the remaining unlabeled corpus samples by using the initial classification model, obtain a label corresponding to each corpus data of the remaining unlabeled corpus samples, that is, a question included in each corpus data of the remaining unlabeled corpus samples is used as an input of the initial classification model, and obtain a label corresponding to each corpus data of the remaining unlabeled corpus samples. Then, the sample supplementing unit 806 may obtain a supplemented training sample according to a question and a corresponding label included in each piece of corpus data of the remaining unlabeled corpus samples, where each piece of training sample of the supplemented training sample includes a question and a corresponding label. The remaining unlabeled corpus samples are obtained by removing the unlabeled corpus sample corresponding to the first training sample from the unlabeled corpus samples, and each piece of training sample data of the first training sample corresponds to one piece of corpus data of the unlabeled corpus sample.

The model building unit 807 combines the supplementary training samples and the first training sample as a second training sample, each piece of training sample of the second training sample including a question and a corresponding label. The model establishing unit 807 takes the problem included in each piece of training sample data of the second training sample as the input of the initial classification model, takes the label included in each piece of training sample data of the second training sample as the output of the initial classification model, keeps the parameter of the Encoder layer of the initial classification model unchanged in the training process, trains the initial classification model, and can obtain a text classification model.

The classification device of the unlabeled corpus provided by the embodiment of the invention can establish the text classification model without a large number of labeled training samples, thereby improving the accuracy of establishing the text classification model. Because the unlabeled corpus samples are very convenient to obtain, the classification device for the unlabeled corpus provided by the embodiment of the invention can still establish a text classification model under the condition that high-quality labeled samples are difficult to obtain or the quantity is insufficient, so that the problem of overfitting is avoided.

Fig. 10 is a schematic structural diagram of a device for classifying unlabeled corpora according to another embodiment of the present invention, and as shown in fig. 10, the sample obtaining unit 804 includes a clustering subunit 8041, a sampling subunit 8042, and an obtaining subunit 8043, where:

the clustering subunit 8041 is configured to cluster the unlabeled corpus samples to obtain unlabeled corpus samples of a preset category; the sampling subunit 8042 is configured to sample the unlabeled corpus sample of each preset category to obtain an original sample; the obtaining subunit 8043 is configured to obtain the first training sample according to the labeled original sample.

Specifically, the clustering subunit 8041 may cluster the unlabeled corpus samples by using a clustering algorithm to obtain unlabeled corpus samples of preset categories, that is, each preset category corresponds to a certain amount of corpus data of the unlabeled corpus samples, and each corpus data of the unlabeled corpus samples corresponds to one preset category. The preset category is set according to actual experience, and the embodiment of the invention is not limited; the clustering algorithm can adopt LDA (latent Dirichlet allocation) or K-means clustering algorithm. It can be understood that the preset categories are the number of categories in clustering, and the label corresponding to each preset category is not known.

In order to reduce the amount of corpus data of the unlabeled corpus sample to be labeled, the sampling subunit 8042 samples the unlabeled corpus sample of each preset category, that is, obtains a certain amount of corpus samples from the unlabeled corpus sample of each preset category, and obtains an original sample. The proportion or the number of the samples is set according to actual needs, and the embodiment of the invention is not limited.

After the original sample is obtained, manually labeling each corpus data of the original sample according to semantics and a preset category to obtain a label corresponding to each corpus data of the original sample, thereby obtaining a labeled original sample. The obtaining subunit 8043 obtains a question included in each piece of corpus data of the labeled original sample and a label corresponding to each piece of corpus data, to obtain the first training sample, where each piece of training sample of the first training sample includes a question and a corresponding label.

Fig. 11 is a schematic structural diagram of a sorting apparatus for unlabeled corpora according to another embodiment of the present invention, and as shown in fig. 11, on the basis of the foregoing embodiments, further, the sample supplementing unit 806 includes a labeling subunit 8061 and a supplementing subunit 8062, where:

the labeling subunit 8061 is configured to label the remaining unlabeled corpus samples through the initial classification model, and obtain a label corresponding to each corpus data in the remaining unlabeled corpus samples; the supplement subunit 8062 is configured to obtain the supplement training sample according to a label corresponding to each piece of corpus data in the remaining unlabeled corpus samples and a problem included in each piece of corpus data.

Specifically, after the initial classification model is obtained, the labeling subunit 8061 uses the problem included in each piece of corpus data of the remaining unlabeled corpus sample as the input of the initial classification model, so as to obtain the label corresponding to each piece of corpus data of the remaining unlabeled corpus sample.

The supplementing subunit 8062 may obtain a supplemented training sample according to a question included in each piece of corpus data of the remaining unlabeled corpus sample and a label corresponding to each piece of corpus data, where each piece of training sample data of the supplemented training sample includes a question and a corresponding label, each piece of corpus data of the remaining unlabeled corpus sample corresponds to one piece of training sample data of the supplemented training sample, and the corpus data of the remaining unlabeled corpus sample includes the same question as the corresponding training sample data.

On the basis of the above embodiments, further, the coding-decoding framework-based model includes a sequence-to-sequence model.

In particular, the model based on the coding-decoding framework adopts a sequence-to-sequence model, and the Seq2Seq model is suitable for machine translation, text summarization, conversation and other scenes. The embodiment of the invention adopts the Seq2Seq model to train to obtain the pre-training model, can better express the relation between sentences, and the Encoder layer of the pre-training model can better express the sentences. The Seq2Seq model can adopt LSTM or GRU algorithm to realize the Encoder layer and the Decoder layer, and can also adopt Transformer algorithm to realize the Encoder-Decoder layer. The GRU algorithm is faster than the LSTM algorithm in training speed, and the Transformer algorithm is better in model effect due to the fact that an attention mechanism is added.

Fig. 12 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 12, the electronic device may include: a processor (processor)1201, a communication Interface (Communications Interface)1202, a memory (memory)1203 and a communication bus 1204, wherein the processor 1201, the communication Interface 1202 and the memory 1203 communicate with each other through the communication bus 1204. The processor 1201 may call logic instructions in the memory 1203 to perform the following method: obtaining a non-tag corpus, wherein the non-tag corpus comprises at least one question; inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

In addition, the logic instructions in the memory 1203 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: obtaining a non-tag corpus, wherein the non-tag corpus comprises at least one question; inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

The present embodiment provides a computer-readable storage medium, which stores a computer program, where the computer program causes the computer to execute the method provided by the above method embodiments, for example, the method includes: obtaining a non-tag corpus, wherein the non-tag corpus comprises at least one question; inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after training based on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In the description herein, reference to the description of the terms "one embodiment," "a particular embodiment," "some embodiments," "for example," "an example," "a particular example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for classifying unlabeled corpora is characterized by comprising the following steps:

inputting each question included in the unlabeled corpus into a text classification model, and outputting a label corresponding to each question; the text classification model is obtained after being trained on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer;

the step of obtaining the text classification model based on the unlabeled corpus sample training comprises the following steps:

training a model of an encoding-decoding framework based on the unlabeled corpus sample to obtain a pre-training model, wherein the pre-training model comprises an encoding layer;

sampling the unlabeled corpus samples to obtain first training samples, wherein each piece of training sample data in the first training samples comprises a question and a corresponding label;

training a first training model based on the first training sample to obtain an initial classification model; the first training model comprises a classification layer and a coding layer of the pre-training model, and parameters of the coding layer are kept unchanged in the training process of the first training model;

obtaining a supplementary training sample based on the residual unlabeled corpus samples and the initial classification model; wherein the remaining unlabeled corpus samples are obtained by removing the unlabeled corpus sample corresponding to the first training sample from the unlabeled corpus sample;

training the initial classification model based on a second training sample to obtain the text classification model; wherein the second training samples comprise the first training samples and the supplementary training samples, and parameters of an encoding layer of the initial classification model are kept unchanged in the training process of the initial classification model;

wherein, the sampling the unlabeled corpus sample to obtain a first training sample comprises:

clustering the unlabeled corpus samples to obtain unlabeled corpus samples of a preset category;

sampling the unlabeled corpus sample of each preset category to obtain an original sample;

and obtaining the first training sample according to the marked original sample.

2. The method of claim 1, wherein obtaining a supplementary training sample based on the remaining unlabeled corpus samples and the initial classification model comprises:

labeling the remaining unlabeled corpus samples through the initial classification model to obtain labels corresponding to each corpus data in the remaining unlabeled corpus samples;

and obtaining the supplementary training sample according to the label corresponding to each piece of corpus data in the rest unlabeled corpus samples and the problem included in each piece of corpus data.

3. The method according to claim 1 or 2, characterized in that the coding-decoding framework based model employs a sequence-to-sequence model.

4. A device for classifying unlabeled corpus, comprising:

the classification unit is used for inputting each question included in the unlabeled corpus into the text classification model and outputting a label corresponding to each question; the text classification model is obtained after being trained on unlabeled corpus samples, and each corpus data in the unlabeled corpus samples comprises a question and an answer;

the pre-training unit is used for training a model of a coding-decoding frame based on the unlabeled corpus sample to obtain a pre-training model, and the pre-training model comprises a coding layer;

a sample obtaining unit, configured to sample the unlabeled corpus sample to obtain first training samples, where each piece of training sample data in the first training sample includes a question and a corresponding label;

the initial training unit is used for training a first training model based on the first training sample to obtain an initial classification model; the first training model comprises a classification layer and a coding layer of the pre-training model, and parameters of the coding layer are kept unchanged in the training process of the first training model;

the sample supplementing unit is used for obtaining a supplementing training sample based on the residual unlabeled corpus samples and the initial classification model; wherein the remaining unlabeled corpus samples are obtained by removing the unlabeled corpus sample corresponding to the first training sample from the unlabeled corpus sample;

the model establishing unit is used for training the initial classification model based on a second training sample to obtain the text classification model; wherein the second training samples comprise the first training samples and the supplementary training samples, and parameters of an encoding layer of the initial classification model are kept unchanged in the training process of the initial classification model;

wherein the sample obtaining unit includes:

the clustering subunit is used for clustering the unlabeled corpus samples to obtain unlabeled corpus samples of preset categories;

the sampling subunit is used for sampling the unlabeled corpus sample of each preset category to obtain an original sample;

and the obtaining subunit is used for obtaining the first training sample according to the marked original sample.

5. The apparatus of claim 4, wherein the sample supplementation unit comprises:

the labeling subunit is used for labeling the remaining unlabeled corpus samples through the initial classification model to obtain a label corresponding to each corpus data in the remaining unlabeled corpus samples;

and the supplementary subunit is used for obtaining the supplementary training sample according to the label corresponding to each piece of corpus data in the remaining unlabeled corpus samples and the problem included in each piece of corpus data.

6. The apparatus of claim 4 or 5, wherein the coding-decoding framework based model employs a sequence-to-sequence model.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 3 are implemented when the program is executed by the processor.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.