CN110297909A

CN110297909A - A kind of classification method and device of no label corpus

Info

Publication number: CN110297909A
Application number: CN201910602361.6A
Authority: CN
Inventors: 刘华杰; 李晓萍; 张宏韬
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2019-07-05
Filing date: 2019-07-05
Publication date: 2019-10-01
Anticipated expiration: 2039-07-05
Also published as: CN110297909B

Abstract

The present invention provides the classification method and device of a kind of no label corpus, which comprises obtains without label corpus, the no label corpus includes at least one problem；Each problem that the no label corpus includes is input to textual classification model, exports the corresponding label of each problem；Wherein, the textual classification model is based on obtaining after no label corpus sample training, and every corpus data in the no label corpus sample includes a problem and an answer.Described device is for executing the above method.The classification method and device of no label corpus provided in an embodiment of the present invention improve the accuracy to the classification of no label corpus.

Description

A kind of classification method and device of no label corpus

Technical field

The present invention relates to field of artificial intelligence more particularly to a kind of classification methods and device of no label corpus.

Background technique

With the development of artificial intelligence technology, the dialogue robot based on artificial intelligence have been widely used customer service, outgoing call, The various fields such as sale, intelligent search, and intention assessment directly determines pair as the core technology in dialogue robot system The accuracy rate and user experience of words.

Currently, more effective technology is deep learning model in being intended to identification technology, the depth obtained by training The classification to no label corpus may be implemented in learning model, facilitates the identification being intended to.However, the training of deep learning model needs The a large amount of sample of mark is acquired, is taken time and effort very much, and the accumulation of a large amount of labeled data (i.e. sample data) needs very A large amount of labeled data of long time, high quality are very expensive.Have again be exactly deep learning model parameter it is too many, in sample It is easy to generate " over-fitting " when notebook data is less, it is very sensitive to noise data.The prior art in order to solve sample data compared with Caused overfitting problem when few uses naive model when selecting model, and using technologies such as penalty terms, in data processing side Face is still difficult to solve the deep learning mould that sample data is very few, and training is caused to obtain using denoising, the technologies such as sample enlargement The not accurate enough problem of type causes no label corpus classification accuracy very low, affects the application of deep learning model.

Summary of the invention

For the problems of the prior art, the embodiment of the present invention provides the classification method and device of a kind of no label corpus, Problems of the prior art can at least be partially solved.

On the one hand, the present invention proposes a kind of classification method of no label corpus, comprising:

It obtains without label corpus, the no label corpus includes at least one problem；

Each problem that the no label corpus includes is input to textual classification model, exports the corresponding mark of each problem Label；Wherein, the textual classification model is based on obtaining after no label corpus sample training, in the no label corpus sample Every corpus data include a problem and one answer.

On the other hand, the present invention provides a kind of sorter of no label corpus, comprising:

Acquiring unit, for obtaining without label corpus, the no label corpus includes at least one problem；

Taxon, each problem for including by no label corpus are input to textual classification model, and output is each asked Inscribe corresponding label；Wherein, the textual classification model is the no label based on obtaining after no label corpus sample training Every corpus data in corpus sample includes a problem and an answer.

Another aspect, the present invention provide a kind of computer readable storage medium, are stored thereon with computer program, the calculating Machine program realizes the textual classification model foundation side based on no label corpus described in any of the above-described embodiment when being executed by processor The step of method.

The classification method and device of no label corpus provided in an embodiment of the present invention, can obtain no label corpus, then Each problem that no label corpus includes is input to based on the textual classification model obtained after no label corpus sample training, it is defeated The corresponding label of each problem out is difficult to obtain or in the case where lazy weight, can pass through in the mark sample of high quality The textual classification model obtained after no label corpus sample training classifies to no label corpus, improves to no label corpus The accuracy of classification.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.In the accompanying drawings:

Fig. 1 is the flow diagram for the classification method without label corpus that one embodiment of the invention provides.

Fig. 2 is the flow diagram for the classification method without label corpus that yet another embodiment of the invention provides.

Fig. 3 is the structural schematic diagram for the first training pattern that one embodiment of the invention provides.

Fig. 4 be another embodiment of the present invention provides the classification method without label corpus flow diagram.

Fig. 5 is the flow diagram for the classification method without label corpus that further embodiment of this invention provides.

Fig. 6 is the structural schematic diagram for the Seq2Seq model that one embodiment of the invention provides.

Fig. 7 is that the textual classification model based on no label corpus that one embodiment of the invention provides is established the structure of system and shown It is intended to.

Fig. 8 is the structural schematic diagram for the sorter without label corpus that one embodiment of the invention provides.

Fig. 9 be another embodiment of the present invention provides the sorter without label corpus structural schematic diagram.

Figure 10 is the structural schematic diagram for the sorter without label corpus that yet another embodiment of the invention provides.

Figure 11 is the structural schematic diagram for the sorter without label corpus that further embodiment of this invention provides.

Figure 12 is the entity structure schematic diagram for the electronic equipment that one embodiment of the invention provides.

Specific embodiment

Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, with reference to the accompanying drawing to this hair Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously It is not as a limitation of the invention.

Fig. 1 is the flow diagram for the classification method without label corpus that one embodiment of the invention provides, as shown in Figure 1, The classification method of no label corpus provided in an embodiment of the present invention, comprising:

S101, it obtains without label corpus, the no label corpus includes at least one problem；

Specifically, robot is still talked in either artificial customer service, and voice can be all generated during for customer service Dialogue, can be collected the problems in above-mentioned text by speech recognition technology by above-mentioned voice dialogue transcription at text It rises as no label corpus, above-mentioned no label corpus is not by the text data of classification, and the no label corpus includes At least one problem.

S102, each problem that the no label corpus includes is input to textual classification model, exports each problem pair The label answered；Wherein, the textual classification model is the no label corpus based on obtaining after no label corpus sample training Every corpus data in sample includes a problem and an answer.

Specifically, after obtaining the no label corpus, using each problem in the no label corpus as text The input of disaggregated model can export the corresponding label of each problem, the label by the processing of the textual classification model For identifying type belonging to described problem.The textual classification model be based on being obtained after no label corpus sample training, Every corpus data in the no label corpus sample includes a problem and an answer, and the answer is to described problem Answer, it is also assumed that it is described answer described problem is marked, the corpus number that the no label corpus sample includes According to quantity be configured according to actual needs, the embodiment of the present invention is without limitation.The specific training of the textual classification model Process sees below described, herein without repeating.The executing subject of the embodiment of the present invention includes but is not limited to computer.

The classification method of no label corpus provided in an embodiment of the present invention, can obtain no label corpus, then will be without mark Each problem that label corpus includes is input to based on the textual classification model obtained after no label corpus sample training, and output is each The corresponding label of problem is difficult to obtain or in the case where lazy weight, can pass through no label in the mark sample of high quality The textual classification model obtained after corpus sample training classifies to no label corpus, improves to the classification of no label corpus Accuracy.

Fig. 2 is the flow diagram for the classification method without label corpus that yet another embodiment of the invention provides, such as Fig. 2 institute Show, includes: based on the step of no label corpus sample training acquisition textual classification model

S201, coding-decoding frame model is trained based on the no label corpus sample, obtains pre-training mould Type, the pre-training model includes coding layer；

Specifically, no label corpus sample is input in coding-decoding frame model, to coding-decoding frame Model is trained, and available pre-training model, the pre-training model includes coding layer.So-called coding, being exactly will input The Sequence Transformed vector at a regular length；Decoding, is exactly then converted into output for the vector of the regular length generated before Sequence, in the training process of coding-decoding frame model, every corpus data in the no label corpus sample includes The problem of corresponding coding, the corresponding decoding of the answer that every corpus data includes, it is hereby achieved that characterization and generalization ability are preferable Semantic coding.Wherein, Encoder-Decoder frame is a model framework in deep learning, Encoder-Decoder The model of frame includes but is not limited to sequence to sequence (Sequence to Sequence, abbreviation Seq2Seq) model.

For example, can record to the customer service of preservation to obtain the no label corpus sample, utilize speech recognition technology Customer service recording is carried out offline transcription into text by (Automatic Speech Recognition, abbreviation ASR), obtains original language Material；Then by the session operational scenarios in above-mentioned original language material, artificial correction process is carried out, obtains the no label voice sample, institute Stating check and correction includes but is not limited to error correction, and alignment sentence etc. is handled, every corpus number in the no label corpus sample of acquisition According to including a problem and an answer.The corpus data is such as are as follows: problem: you are good, and it is more for may I ask the housing loan interest rate of bank It is few? answer: you are good, and current interest rate is 5.6%.Does is alternatively, problem: my credit card amount how many now? answer: you are good, you Current amount is 50,000 RMB.

S202, it is sampled processing to the no label corpus sample, obtains the first training sample, the first training sample Every training sample data in this include a problem and corresponding label；

Specifically, after obtaining the no label corpus sample, processing is sampled to the no label corpus sample, Available first training sample, every training sample data in first training sample include a problem, Yi Jiyu The corresponding label of described problem.For the label for identifying type belonging to described problem, the label is preset.Every institute The corpus data in the corresponding no label corpus sample of training sample data is stated, what the training sample data included asks The problem of topic includes with corresponding corpus data is identical.It will be appreciated that step S202 and step S201 do not have sequencing pass System.

For example, obtaining the classification of pre-set categories without label corpus sample firstly, cluster to the no label corpus sample This；Then, every class is sampled without label corpus sample, obtains original sample, then by manually marking to original sample Note, mark is classified for every corpus data in original sample, obtains the corresponding label of every corpus data；According to process The problem of every corpus data of the original sample of mark includes and the corresponding label of every corpus data obtain first instruction Practice sample.Wherein, the pre-set categories are configured according to actual needs, and the embodiment of the present invention is without limitation；To every class without mark The ratio or quantity that label corpus sample is sampled, are configured, the embodiment of the present invention is without limitation according to actual needs.

S203, the first training pattern is trained based on first training sample, obtains preliminary classification model；Its In, first training pattern includes the coding layer of classification layer and the pre-training mould, in the instruction to first training pattern The parameter of the coding layer remains unchanged during white silk；

Specifically, the first training pattern is established, first training pattern includes the coding of the pre-training model (Encoder) layer and classification layer, input of described Encoder layers of the output as the classification layer, the classification layer can be adopted With Softmax algorithm.The problem of including using every training sample data of first training sample is as first training The input of model, the label for including using every training sample data of first training sample is as first training The output of model, and Encoder layers of parameter constant is kept, first training pattern is trained, it is available initial Disaggregated model, the preliminary classification model will include Encoder layers.In the training process to first training pattern, protect Encoder layers of parameter constant is held, training obtains the parameter of the classification layer, had both been utilized between no label corpus sample in this way Relationship, and can have supervision training classification layer, complete text categorization task, can use less sample training obtain it is extensive The higher preliminary classification model of property.Since the first training pattern uses the coding layer of pre-training model, it is possible to reduce model training When the first training sample for needing in training sample data quantity, save the cost for obtaining the first training sample and time, mention The training effectiveness of high first training pattern also just improves the acquisition efficiency of textual classification model.

For example, Fig. 3 is the structural schematic diagram for the first training pattern that one embodiment of the invention provides, as shown in figure 3, described First training pattern includes embedding layers, Encoder layers and Softmax layers of Word.By the every of first training sample The problem of training sample data include, such as described problem are as follows: you are good, and may I ask you is XXX, is input to Word Embedding layers, embedding layers of the Word term vector that each word in the above problem is converted into regular length are then defeated Encoder layers are arrived out, and Encoder layers handle the term vector of input, and to Softmax layers, state becomes output state variable C C is measured as Softmax layers of initial value, Softmax layers can export the corresponding label of the above problem after training, such as defeated Go wrong: you are good, and may I ask you is XXX, and corresponding label label_ asks whether me.

S204, it is based on remaining no label corpus sample and the preliminary classification model, obtains supplementary training sample；Its In, the remaining no label corpus sample is that removal is corresponding with first training sample from the no label corpus sample Without obtaining after label corpus sample；

It specifically, can be to remaining nothing using the preliminary classification model after obtaining the preliminary classification model Label corpus sample is labeled, and obtains the corresponding label of every corpus data of remaining no label corpus sample, i.e., remaining The input as the preliminary classification model of every corpus data without label corpus sample the problem of including, can obtain surplus The remaining corresponding label of every corpus data without label corpus sample.Then, according to the every of remaining no label corpus sample The problem of corpus data includes and corresponding label, can obtain supplementary training sample, and every of the supplementary training sample Training sample data include a problem and corresponding label.Wherein, the remaining no label corpus sample is from the nothing It is obtained after removal no label corpus sample corresponding with first training sample in label corpus sample, first training The corpus data of the corresponding no label corpus sample of every training sample data of sample.

S205, the preliminary classification model is trained based on the second training sample, obtains the textual classification model； Wherein, second training sample includes first training sample and the supplementary training sample, to the preliminary classification In the training process of model, the parameter for the coding layer that the preliminary classification model includes is remained unchanged.

Specifically, the supplementary training sample and first training sample are combined, as the second training sample, Every training sample data of second training sample include a problem and corresponding label.By second training sample The input as the preliminary classification model of every training sample data the problem of including, by the every of second training sample Output of the label that training sample data include as the preliminary classification model, and keep described initial in the training process The parameter constant of the Encoder layer of disaggregated model, is trained the preliminary classification model, available text classification mould Type.

The classification method of no label corpus provided in an embodiment of the present invention, in the training sample feelings without largely there is label Under condition, it will be able to establish textual classification model, improve the accuracy of textual classification model foundation.Due to no label corpus sample Obtain very convenient, the classification method of no label corpus provided in an embodiment of the present invention can be difficult in the mark sample of high quality In the case where acquisition or lazy weight, the problem of still can establish textual classification model, avoid over-fitting.

Fig. 4 be another embodiment of the present invention provides the classification method without label corpus flow diagram, such as Fig. 4 institute Show, it is further, described that processing is sampled to the no label corpus sample on the basis of the various embodiments described above, it obtains First training sample includes:

S2021, the no label corpus sample is clustered, obtain pre-set categories without label corpus sample；

Specifically, the no label corpus sample can be clustered using clustering algorithm, obtains the nothing of pre-set categories Label corpus sample, i.e., each pre-set categories correspond to the corpus data without label corpus sample described in certain amount, the no mark Every corpus data of label corpus sample can correspond to the pre-set categories.Wherein, the pre-set categories are based on practical experience It is configured, the embodiment of the present invention is without limitation；The clustering algorithm can use LDA (Latent Dirichlet ) or K-means clustering algorithm Allocation.It will be appreciated that in cluster, the pre-set categories are the quantity of classification, often The corresponding label of a pre-set categories is not aware that.

S2022, the no label corpus sample of each pre-set categories is sampled, obtains original sample；

Specifically, in order to reduce the no label corpus sample being labeled corpus data quantity, to each institute The no label corpus sample for stating pre-set categories is sampled, i.e., from the no label corpus sample of each pre-set categories In this, a certain number of corpus samples are obtained, original sample is obtained.Wherein, the ratio or quantity of the sampling are according to reality It needs to be configured, the embodiment of the present invention is without limitation.

S2023, the original sample marked according to process, obtain first training sample.

It specifically, can be manually to every corpus data root of the original sample after obtaining the original sample According to semanteme, its pre-set categories is labeled, the corresponding label of every corpus data of the original sample is obtained, to obtain By the original sample of mark.Obtain the problem of including by every corpus data of the original sample of mark and every corpus number According to corresponding label, first training sample is obtained, every training sample data of first training sample include one Problem and corresponding label.

Fig. 5 is the flow diagram for the classification method without label corpus that further embodiment of this invention provides, such as Fig. 5 institute Show, it is further, described to be based on remaining no label corpus sample and the preliminary classification on the basis of the various embodiments described above Model, obtaining supplementary training sample includes:

S2041, remaining no label corpus sample is labeled by the preliminary classification model, obtains remaining nothing The corresponding label of every corpus data in label corpus sample；

Specifically, after obtaining the preliminary classification model, by every corpus number of remaining no label corpus sample According to input the problem of including as the preliminary classification model, every corpus of remaining no label corpus sample can be obtained The corresponding label of data.

S2042, according to the corresponding label of every corpus data and every corpus number in remaining no label corpus sample The problem of according to including, obtain the supplementary training sample.

Specifically, the problem of including according to every corpus data of remaining no label corpus sample and every corpus data Corresponding label, can obtain supplementary training sample, and every training sample data of the supplementary training sample include one and ask One of topic and corresponding label, every corpus data of the remaining no label corpus sample and the supplementary training sample Training sample data are corresponding, and the corpus data of the remaining no label corpus sample includes phase with corresponding training sample data Same problem.

It is further, described to be arrived based on coding-decoding frame model using sequence on the basis of the various embodiments described above Series model.

Specifically, described to arrive series model using sequence based on coding-decoding frame model, Seq2Seq model is applicable In machine translation, text snippet, the scenes such as dialogue.The embodiment of the present invention obtains the pre-training using Seq2Seq model training Model can preferably express the relationship between sentence and sentence, and the Encoder layer of the pre-training model can preferable table Up to sentence.Seq2Seq model can be followed using length memory (Long Short-Term Memory, abbreviation LSTM) or gate Ring element (Gated Recurrent Unit, abbreviation GRU) algorithm realizes Encoder layers and Decoder layers, can also use Transformer algorithm realizes Encoder-Decoder layers.GRU algorithm relative to LSTM algorithm training speed faster, and Transformer algorithm joined attention mechanism, and modelling effect is more preferably.

The Seq2Seq model realized using LSTM is illustrated below.Fig. 6 is that one embodiment of the invention provides The structural schematic diagram of Seq2Seq model, as shown in fig. 6, Seq2Seq model includes two Embedding layers of Word, Encoder Layer and Decoder layers.Embedding layers of Word for being converted into regular length term vector for each word for inputting sentence. Encoder layers and Decoder layers are all made of LSTM realization.What every corpus data by the no label corpus sample included asks Topic is input to Word Embedding layer 1, such as problem are as follows: you are good, and may I ask you is XXX, and Word Embedding layer 1 will be upper The term vector that each word in problem is converted into regular length is stated, is then output to Encoder layers, Encoder layers to input Term vector is handled, and output state variable C is to Decoder layers, as Decoder layers of initial value.The no label corpus The answer that every corpus data of sample includes is input to Word Embedding layer 2, such as answers are as follows: I yes, Word Each word in above-mentioned answer is converted into the term vector of regular length by Embedding layer 2, is then output to Decoder layers. Decoder layers are exported by the study to state variable C, may finally export the corresponding answer of the above problem.Wherein, GO It is spcial character with EOS, GO indicates the starting answered, and EOS indicates the end answered.

Fig. 7 is that the textual classification model based on no label corpus that one embodiment of the invention provides is established the structure of system and shown It is intended to, as shown in fig. 7, the textual classification model provided in an embodiment of the present invention based on no label corpus establishes system, Ke Yiyong In establishing the textual classification model.The system comprises: dialogue corpus generating means 1, sample generating means 2, model training Device 3, sample storage device 4 and model storage 5, in which:

Dialogue corpus generating means 1 are used to acquiring and generating no label corpus sample.Talking with corpus generating means 1 includes Offline speech transcription unit 101 and check and correction unit 102.Offline speech transcription unit 101 for by the transcription of taped conversations corpus at Text, check and correction unit 102 generate the no label corpus sample for proofreading the text after transcription.Proofreading unit 102 can be with It sends the no label corpus sample of generation in sample storage device 4 and stores.Wherein, described proofread includes but is not limited to Dialogue alignment, artificial error correction etc..The dialogue sample without label obtained by dialogue corpus generating means 1, example are as follows:

Problem: are you good, may I ask the housing loan interest rate of bank is how many? answer: you are good, and current interest rate is xx.

Does is problem: my credit card amount how many now? answer: you are good, your current amount is xxx.

Sample generating means 2 are used to generate the first training sample according to the no label corpus sample.Sample generating means 2 Including sample clustering unit 201, sample sampling unit 202 and sample mark unit 203.Sample clustering unit 201 is for using Clustered described in clustering algorithm without label corpus sample, obtain pre-set categories without label corpus sample.Wherein, sample clustering Unit 201 can get the no label corpus sample from sample storage device 4；The clustering algorithm can for LDA or K-means algorithm.Sample sampling unit 202 is for taking out the no label corpus sample of each pre-set categories Sample obtains original sample.Sample marks unit 203 and is used to obtain the first training sample according to the original sample by mark This.Wherein, the original sample is labeled, it can be using manually to each pre-set categories mark label.Sample mark is single Member 203 can send sample storage device 4 for first training sample and store, and prepare for subsequent initial model training. The example of first training sample is as follows:

Are you good for label_ inquiry business, may I ask the housing loan interest rate of bank is how many?

Does is my credit card amount of label_ inquiry business how many now?

Label_ handles credit card business, and I wants to do a credit card.

Wherein, it is the label, root that label_ inquiry business, label_ inquiry business and label_, which handle credit card business, It is configured according to practical experience.After label_ inquiry business, label_ inquiry business and label_ handle credit card business space Sentence be each corresponding problem of label.

Model training apparatus 3 corresponds to model for obtaining according to training data training.Model training apparatus 3 includes pre- instruction Practice model unit 301, initial model training unit 302 and final mask training unit 303.Pre-training model unit 301 can be with The no label corpus sample is obtained from sample storage device 4, for using according to the no label corpus sample The algorithm of Encoder-Decoder frame carries out model training, obtains pre-training model.Wherein, Encoder-Decoder frame Algorithm can use Seq2Seq algorithm, Seq2Seq algorithm be usually used in machine translation, text snippet, the scenes such as dialogue.This hair Bright embodiment uses Seq2Seq algorithm as pre- instruction model algorithm, and training, which obtains pre-training model, can preferably express sentence Relationship between sentence, and Encoder layers can preferably express sentence.

Initial model training unit 302 is used for according to first training sample to including the pre-training model Encoder layers and classification layer the first training pattern be trained, obtain preliminary classification model.Wherein, the Encoder in training Layer parameter is fixed and invariable, and is trained to the parameter of classification layer.The each of the no label corpus sample had both been utilized in this way Relationship between corpus data, and can have the training classifier of supervision, complete text categorization task.The method can use number It measures less sample training and obtains the higher preliminary classification model of generalization.It, can be right after obtaining the preliminary classification model Remaining no label corpus sample is labeled using preliminary classification model, may finally obtain supplementary training sample, will be described First training sample and the supplementary training sample are combined, and obtain more more than the first training sample training data Two training samples.Final mask training unit 303 is for carrying out the preliminary classification model according to second training sample Training obtains textual classification model.Wherein, in training, the Encoder layer parameter of the preliminary classification model is to immobilize , the parameter of the classification layer of the preliminary classification model is trained.

Sample storage device 4 includes unlabeled exemplars storage unit 401 for storing sample data, sample storage device 4 With sample data storage unit 402.Unlabeled exemplars storage unit 401 is for storing the no label corpus sample.Sample number According to storage unit 402 for storing first training sample and second training sample.

Model storage 5 is used for storage model.Model storage 5 includes pre-training model storage unit 501, initially Disaggregated model storage unit 502 and final mask storage unit 503.Pre-training model storage unit 501 is described pre- for storing Training pattern, preliminary classification model storage unit 502 is for storing the preliminary classification model, final mask storage unit 503 For storing the textual classification model.

Wherein, the textual classification model is after being applied in practice, over time, can obtain more Corpus data, so as to be carried out to the no label corpus sample, first training sample and second training sample Expand.Can be used expand after without label corpus sample training, the coding-decoding frame model is instructed again Practice, updates the pre-training model, can correspondingly update first training pattern after the pre-training model modification；It can be with Using first training sample after expansion, updated first training pattern is trained again, is updated described first Beginning disaggregated model；Second training sample after expanding can be used, the updated preliminary classification model is carried out Training, updates the textual classification model, so that the mechanism that continuous iteration updates is set up, it can be continuous to textual classification model Optimization updates, and further increases the accuracy of textual classification model.

Fig. 8 is the structural schematic diagram for the sorter without label corpus that one embodiment of the invention provides, as shown in figure 8, The sorter of no label corpus provided in an embodiment of the present invention includes acquiring unit 801 and taxon 802, in which:

For acquiring unit 801 for obtaining without label corpus, the no label corpus includes at least one problem；Taxon 802, for each problem that no label corpus includes to be input to textual classification model, export the corresponding label of each problem；Its In, the textual classification model be based on being obtained after no label corpus sample training, it is every in the no label corpus sample Corpus data includes a problem and an answer.

Specifically, robot is still talked in either artificial customer service, and voice can be all generated during for customer service Dialogue, can be collected the problems in above-mentioned text by speech recognition technology by above-mentioned voice dialogue transcription at text It rises as no label corpus, above-mentioned no label corpus is not by the text data of classification, and acquiring unit 801 can obtain The no label corpus, the no label corpus includes at least one problem.

After obtaining the no label corpus, taxon 802 using each problem in the no label corpus as The input of textual classification model can export the corresponding label of each problem by the processing of the textual classification model, described Label is for identifying type belonging to described problem.The textual classification model is based on obtaining after no label corpus sample training , every corpus data in the no label corpus sample includes a problem and an answer, and the answer is to described The answer of problem, it is also assumed that the answer marks described problem, the language that the no label corpus sample includes The quantity of material data is configured according to actual needs, and the embodiment of the present invention is without limitation.The textual classification model it is specific Training process sees below described, herein without repeating.

The sorter of no label corpus provided in an embodiment of the present invention, can obtain no label corpus, then will be without mark Each problem that label corpus includes is input to based on the textual classification model obtained after no label corpus sample training, and output is each The corresponding label of problem is difficult to obtain or in the case where lazy weight, can pass through no label in the mark sample of high quality The textual classification model obtained after corpus sample training classifies to no label corpus, improves to the classification of no label corpus Accuracy.

Fig. 9 be another embodiment of the present invention provides the sorter without label corpus structural schematic diagram, such as Fig. 9 institute Show, the sorter of no label corpus provided in an embodiment of the present invention further include pre-training unit 803, sample obtaining unit 804, Initial training unit 805, sample supplementary units 806 and model foundation unit 807, in which:

Pre-training unit 803 is obtained for being trained based on no label corpus sample to coding-decoding frame model Pre-training model, the pre-training model includes coding layer；Sample obtaining unit 804 be used for the no label corpus sample into Line sampling processing obtains the first training sample, and every training sample data in first training sample include a problem With corresponding label；Initial training unit 805 is used to be trained the first training pattern based on first training sample, obtains Obtain preliminary classification model；Wherein, first training pattern includes the coding layer of classification layer and the pre-training mould, to described The parameter of coding layer described in the training process of first training pattern remains unchanged；Sample supplementary units 806 are used for based on residue Without label corpus sample and the preliminary classification model, obtain supplementary training sample；Wherein, the remaining no label corpus Sample is obtained after removing no label corpus sample corresponding with first training sample in the no label corpus sample 's；Model foundation unit 807 is used to be trained the preliminary classification model based on the second training sample, obtains the text Disaggregated model；Wherein, second training sample includes first training sample and the supplementary training sample, to described In the training process of preliminary classification model, the parameter of the coding layer of the preliminary classification model is remained unchanged.

Specifically, no label corpus sample is input in coding-decoding frame model by pre-training unit 803, to volume The model of code-decoding frame is trained, and available pre-training model, the pre-training model includes coding layer.So-called volume List entries, is exactly converted to the vector of a regular length by code；Decoding, exactly by the vector of the regular length generated before It is then converted into output sequence, in the training process of coding-decoding frame model, every in the no label corpus sample The corresponding coding of the problem of corpus data includes, the corresponding decoding of the answer that every corpus data includes, it is hereby achieved that characterization and The preferable semantic coding of generalization ability.Wherein, Encoder-Decoder frame is a model framework in deep learning, The model of Encoder-Decoder frame includes but is not limited to sequence to series model.

After obtaining the no label corpus sample, sample obtaining unit 804 carries out the no label corpus sample Sample process, available first training sample, every training sample data in first training sample include one and ask Topic, and label corresponding with described problem.For the label for identifying type belonging to described problem, the label is default 's.Corpus data in the corresponding no label corpus sample of every training sample data, the number of training According to identical the problem of including with corresponding corpus data the problem of including.

Specifically, initial training unit 805 establishes the first training pattern, and first training pattern includes the pre-training The coding layer and classification layer of model, input of described Encoder layers of the output as the classification layer, the classification layer can be adopted With Softmax algorithm.The problem of every training sample data of first training sample are included by initial training unit 805 work For the input of first training pattern, the label for including using every training sample data of first training sample is as institute The output of first training pattern is stated, and keeps Encoder layers of parameter constant, first training pattern is instructed Practice, available preliminary classification model, the preliminary classification model will include Encoder layers.To first training pattern Training process in, keep Encoder layer of parameter constant, training obtains the parameter of the layer of classifying, nothing is both utilized in this way Relationship between label corpus sample, and can have the training classification layer of supervision, text categorization task is completed, can use less Sample training obtains the higher preliminary classification model of generalization.

After obtaining the preliminary classification model, sample supplementary units 806 can be right using the preliminary classification model Remaining no label corpus sample is labeled, and obtains the corresponding mark of every corpus data of remaining no label corpus sample Input of the problem of label, i.e., every corpus data of remaining no label corpus sample includes as the preliminary classification model, The corresponding label of every corpus data of remaining no label corpus sample can be obtained.Then, 806 basis of sample supplementary units The problem of every corpus data of remaining no label corpus sample includes and corresponding label, can obtain supplementary training sample This, every training sample data of the supplementary training sample include a problem and corresponding label.Wherein, described remaining No label corpus sample is that no label corpus corresponding with first training sample is removed from the no label corpus sample It is obtained after sample, the language of the corresponding no label corpus sample of every training sample data of first training sample Expect data.

The supplementary training sample and first training sample are combined by model foundation unit 807, as second Training sample, every training sample data of second training sample include a problem and corresponding label.Model foundation The problem of every training sample data of second training sample are included by unit 807 is as the defeated of the preliminary classification model Enter, the label for including using every training sample data of second training sample as the output of the preliminary classification model, And keep the parameter constant of the Encoder layer of the preliminary classification model in the training process, to the preliminary classification model into Row training, available textual classification model.

The sorter of no label corpus provided in an embodiment of the present invention, in the training sample feelings without largely there is label Under condition, it will be able to establish textual classification model, improve the accuracy of textual classification model foundation.Due to no label corpus sample Obtain very convenient, the sorter of no label corpus provided in an embodiment of the present invention can be difficult in the mark sample of high quality In the case where acquisition or lazy weight, the problem of still can establish textual classification model, avoid over-fitting.

Figure 10 is the structural schematic diagram for the sorter without label corpus that yet another embodiment of the invention provides, such as Figure 10 institute Show, sample obtaining unit 804 includes cluster subelement 8041, sub-unit 8042 and obtains subelement 8043, in which:

Cluster subelement 8041 for being clustered to the no label corpus sample, obtain pre-set categories without label language Expect sample；Sub-unit 8042 is obtained for being sampled to the no label corpus sample of each pre-set categories Original sample；Subelement 8043 is obtained to be used to obtain first training sample according to the original sample by mark.

Specifically, cluster subelement 8041 can cluster the no label corpus sample using clustering algorithm, obtain Pre-set categories without label corpus sample, i.e., each pre-set categories correspond to the corpus without label corpus sample described in certain amount Every corpus data of data, the no label corpus sample can correspond to the pre-set categories.Wherein, the pre-set categories It is configured based on practical experience, the embodiment of the present invention is without limitation；The clustering algorithm can use LDA (Latent Dirichlet Allocation) or K-means clustering algorithm.It will be appreciated that the pre-set categories are point in cluster The quantity of class, the corresponding label of each pre-set categories are not aware that.

In order to reduce the no label corpus sample being labeled corpus data quantity, sub-unit 8042 is right The no label corpus sample of each pre-set categories is sampled, i.e., from the no label of each pre-set categories In corpus sample, a certain number of corpus samples are obtained, original sample is obtained.Wherein, the ratio of the sampling or quantity root It is configured according to actual needs, the embodiment of the present invention is without limitation.

After obtaining the original sample, can manually to every corpus data of the original sample according to semanteme, Its pre-set categories is labeled, the corresponding label of every corpus data of the original sample is obtained, to obtain by mark The original sample of note.It obtains subelement 8043 and obtains the problem of including by every corpus data of original sample of mark and every Bar corresponding label of corpus data, obtains first training sample, every training sample data of first training sample Including a problem and corresponding label.

Figure 11 is the structural schematic diagram for the sorter without label corpus that further embodiment of this invention provides, such as Figure 11 institute Show, on the basis of the various embodiments described above, further, sample supplementary units 806 include mark subelement 8061 and supplement Unit 8062, in which:

Mark subelement 8061 is used to be labeled remaining no label corpus sample by the preliminary classification model, Obtain the corresponding label of every corpus data in remaining no label corpus sample；Subelement 8062 is supplemented to be used for according to remaining The problem of corresponding label of every corpus data and every corpus data include in no label corpus sample, obtains the supplement Training sample.

Specifically, after obtaining the preliminary classification model, subelement 8061 is marked by remaining no label corpus sample Input of the problem of this every corpus data includes as the preliminary classification model, can obtain remaining no label corpus The corresponding label of every corpus data of sample.

Supplement the problem of subelement 8062 includes according to every corpus data of remaining no label corpus sample and every The corresponding label of corpus data can obtain supplementary training sample, every training sample data packet of the supplementary training sample Include a problem and corresponding label, the every corpus data and the supplementary training sample of the remaining no label corpus sample This training sample data are corresponding, the corpus data of the remaining no label corpus sample and corresponding number of training According to including identical problem.

On the basis of the various embodiments described above, further, described based on coding-decoding frame model includes that sequence arrives Series model.

Specifically, described to arrive series model using sequence based on coding-decoding frame model, Seq2Seq model is applicable In machine translation, text snippet, the scenes such as dialogue.The embodiment of the present invention obtains the pre-training using Seq2Seq model training Model can preferably express the relationship between sentence and sentence, and the Encoder layer of the pre-training model can preferable table Up to sentence.Seq2Seq model can realize Encoder layers and Decoder layers using LSTM GRU algorithm, can also use Transformer algorithm realizes Encoder-Decoder layers.GRU algorithm relative to LSTM algorithm training speed faster, and Transformer algorithm joined attention mechanism, and modelling effect is more preferably.

Figure 12 is the entity structure schematic diagram for the electronic equipment that one embodiment of the invention provides, as shown in figure 12, the electronics Equipment may include: processor (processor) 1201, communication interface (Communications Interface) 1202, deposit Reservoir (memory) 1203 and communication bus 1204, wherein processor 1201, communication interface 1202, memory 1203 pass through logical Letter bus 1204 completes mutual communication.Processor 1201 can call the logical order in memory 1203, to execute such as Lower method: it obtains without label corpus, the no label corpus includes at least one problem；By the no label corpus include it is every A problem is input to textual classification model, exports the corresponding label of each problem；Wherein, the textual classification model is based on nothing It is obtained after label corpus sample training, every corpus data in the no label corpus sample includes a problem and one It answers.

In addition, the logical order in above-mentioned memory 1203 can be realized by way of SFU software functional unit and conduct Independent product when selling or using, can store in a computer readable storage medium.Based on this understanding, originally Substantially the part of the part that contributes to existing technology or the technical solution can be in other words for the technical solution of invention The form of software product embodies, which is stored in a storage medium, including some instructions to So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation of the present invention The all or part of the steps of example the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various It can store the medium of program code.

The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, it obtains without label corpus, The no label corpus includes at least one problem；Each problem that the no label corpus includes is input to text classification mould Type exports the corresponding label of each problem；Wherein, the textual classification model is based on obtaining after no label corpus sample training , every corpus data in the no label corpus sample includes a problem and an answer.

The present embodiment provides a kind of computer readable storage medium, the computer-readable recording medium storage computer journey Sequence, the computer program make the computer execute method provided by above-mentioned each method embodiment, for example, obtain nothing Label corpus, the no label corpus includes at least one problem；Each problem that the no label corpus includes is input to Textual classification model exports the corresponding label of each problem；Wherein, the textual classification model is based on no label corpus sample It is obtained after training, every corpus data in the no label corpus sample includes a problem and an answer.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean it is described in conjunction with this embodiment or example Particular features, structures, materials, or characteristics are included at least one embodiment or example of the invention.In the present specification, Schematic expression of the above terms may not refer to the same embodiment or example.Moreover, the specific features of description, knot Structure, material or feature can be combined in any suitable manner in any one or more of the embodiments or examples.

Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this Within the protection scope of invention.

Claims

1. a kind of classification method of no label corpus characterized by comprising

Each problem that the no label corpus includes is input to textual classification model, exports the corresponding label of each problem； Wherein, the textual classification model is based on obtaining after no label corpus sample training, in the no label corpus sample Every corpus data includes a problem and an answer.

2. the method according to claim 1, wherein obtaining the text point based on no label corpus sample training The step of class model includes:

Coding-decoding frame model is trained based on the no label corpus sample, obtains pre-training model, it is described pre- Training pattern includes coding layer；

Processing is sampled to the no label corpus sample, obtains the first training sample, it is every in first training sample Training sample data include a problem and corresponding label；

The first training pattern is trained based on first training sample, obtains preliminary classification model；Wherein, described first Training pattern includes the coding layer of classification layer and the pre-training mould, described in the training process to first training pattern The parameter of coding layer remains unchanged；

Based on remaining no label corpus sample and the preliminary classification model, supplementary training sample is obtained；Wherein, the residue Be that corresponding with first training sample no label language is removed from the no label corpus sample without label corpus sample It is obtained after material sample；

The preliminary classification model is trained based on the second training sample, obtains the textual classification model；Wherein, described Second training sample includes first training sample and the supplementary training sample, in the training to the preliminary classification model In the process, the parameter of the coding layer of the preliminary classification model remains unchanged.

3. according to the method described in claim 2, it is characterized in that, described be sampled place to the no label corpus sample Reason, obtaining the first training sample includes:

The no label corpus sample is clustered, obtain pre-set categories without label corpus sample；

The no label corpus sample of each pre-set categories is sampled, original sample is obtained；

According to the original sample by mark, first training sample is obtained.

4. according to the method described in claim 2, it is characterized in that, described based on remaining no label corpus sample and described first Beginning disaggregated model, obtaining supplementary training sample includes:

Remaining no label corpus sample is labeled by the preliminary classification model, obtains remaining no label corpus sample The corresponding label of every corpus data in this；

It is asked according to what the corresponding label of every corpus data and every corpus data in remaining no label corpus sample included Topic, obtains the supplementary training sample.

5. method according to any one of claims 1 to 4, which is characterized in that described based on coding-decoding frame model Using sequence to series model.

6. a kind of sorter of no label corpus characterized by comprising

Taxon, each problem for including by no label corpus are input to textual classification model, export each problem pair The label answered；Wherein, the textual classification model is the no label corpus based on obtaining after no label corpus sample training Every corpus data in sample includes a problem and an answer.

7. device according to claim 6, which is characterized in that further include:

Pre-training unit is obtained pre- for being trained based on the no label corpus sample to coding-decoding frame model Training pattern, the pre-training model includes coding layer；

Sample obtaining unit obtains the first training sample for being sampled processing to the no label corpus sample, and described the Every training sample data in one training sample include a problem and corresponding label；

Initial training unit obtains preliminary classification for being trained based on first training sample to the first training pattern Model；Wherein, first training pattern includes the coding layer of classification layer and the pre-training mould, to the first training mould The parameter of coding layer described in the training process of type remains unchanged；

Sample supplementary units obtain supplementary training for being based on remaining no label corpus sample and the preliminary classification model Sample；Wherein, the remaining no label corpus sample is removal and first training from the no label corpus sample It is obtained after the corresponding no label corpus sample of sample；

Model foundation unit obtains the text for being trained based on the second training sample to the preliminary classification model Disaggregated model；Wherein, second training sample includes first training sample and the supplementary training sample, to described In the training process of preliminary classification model, the parameter of the coding layer of the preliminary classification model is remained unchanged.

8. device according to claim 7, which is characterized in that the sample obtaining unit includes:

Cluster subelement, for being clustered to the no label corpus sample, obtain pre-set categories without label corpus sample；

Sub-unit is sampled for the no label corpus sample to each pre-set categories, obtains original sample This；

Subelement is obtained, for obtaining first training sample according to the original sample by mark.

9. device according to claim 7, which is characterized in that the sample supplementary units include:

Mark subelement is remained for being labeled by the preliminary classification model to remaining no label corpus sample It is remaining without the corresponding label of every corpus data in label corpus sample；

Subelement is supplemented, for according to the corresponding label of every corpus data and every language in remaining no label corpus sample The problem of material data include, obtains the supplementary training sample.

10. according to the described in any item devices of claim 6 to 9, which is characterized in that described based on coding-decoding frame mould Type is using sequence to series model.

11. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes the step of any one of claim 1 to 5 the method when executing described program Suddenly.

12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor The step of any one of claim 1 to 5 the method is realized when execution.