CN110297909A - A kind of classification method and device of no label corpus - Google Patents
A kind of classification method and device of no label corpus Download PDFInfo
- Publication number
- CN110297909A CN110297909A CN201910602361.6A CN201910602361A CN110297909A CN 110297909 A CN110297909 A CN 110297909A CN 201910602361 A CN201910602361 A CN 201910602361A CN 110297909 A CN110297909 A CN 110297909A
- Authority
- CN
- China
- Prior art keywords
- sample
- label
- training
- corpus
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides the classification method and device of a kind of no label corpus, which comprises obtains without label corpus, the no label corpus includes at least one problem;Each problem that the no label corpus includes is input to textual classification model, exports the corresponding label of each problem;Wherein, the textual classification model is based on obtaining after no label corpus sample training, and every corpus data in the no label corpus sample includes a problem and an answer.Described device is for executing the above method.The classification method and device of no label corpus provided in an embodiment of the present invention improve the accuracy to the classification of no label corpus.
Description
Technical field
The present invention relates to field of artificial intelligence more particularly to a kind of classification methods and device of no label corpus.
Background technique
With the development of artificial intelligence technology, the dialogue robot based on artificial intelligence have been widely used customer service, outgoing call,
The various fields such as sale, intelligent search, and intention assessment directly determines pair as the core technology in dialogue robot system
The accuracy rate and user experience of words.
Currently, more effective technology is deep learning model in being intended to identification technology, the depth obtained by training
The classification to no label corpus may be implemented in learning model, facilitates the identification being intended to.However, the training of deep learning model needs
The a large amount of sample of mark is acquired, is taken time and effort very much, and the accumulation of a large amount of labeled data (i.e. sample data) needs very
A large amount of labeled data of long time, high quality are very expensive.Have again be exactly deep learning model parameter it is too many, in sample
It is easy to generate " over-fitting " when notebook data is less, it is very sensitive to noise data.The prior art in order to solve sample data compared with
Caused overfitting problem when few uses naive model when selecting model, and using technologies such as penalty terms, in data processing side
Face is still difficult to solve the deep learning mould that sample data is very few, and training is caused to obtain using denoising, the technologies such as sample enlargement
The not accurate enough problem of type causes no label corpus classification accuracy very low, affects the application of deep learning model.
Summary of the invention
For the problems of the prior art, the embodiment of the present invention provides the classification method and device of a kind of no label corpus,
Problems of the prior art can at least be partially solved.
On the one hand, the present invention proposes a kind of classification method of no label corpus, comprising:
It obtains without label corpus, the no label corpus includes at least one problem;
Each problem that the no label corpus includes is input to textual classification model, exports the corresponding mark of each problem
Label;Wherein, the textual classification model is based on obtaining after no label corpus sample training, in the no label corpus sample
Every corpus data include a problem and one answer.
On the other hand, the present invention provides a kind of sorter of no label corpus, comprising:
Acquiring unit, for obtaining without label corpus, the no label corpus includes at least one problem;
Taxon, each problem for including by no label corpus are input to textual classification model, and output is each asked
Inscribe corresponding label;Wherein, the textual classification model is the no label based on obtaining after no label corpus sample training
Every corpus data in corpus sample includes a problem and an answer.
Another aspect, the present invention provide a kind of computer readable storage medium, are stored thereon with computer program, the calculating
Machine program realizes the textual classification model foundation side based on no label corpus described in any of the above-described embodiment when being executed by processor
The step of method.
The classification method and device of no label corpus provided in an embodiment of the present invention, can obtain no label corpus, then
Each problem that no label corpus includes is input to based on the textual classification model obtained after no label corpus sample training, it is defeated
The corresponding label of each problem out is difficult to obtain or in the case where lazy weight, can pass through in the mark sample of high quality
The textual classification model obtained after no label corpus sample training classifies to no label corpus, improves to no label corpus
The accuracy of classification.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.In the accompanying drawings:
Fig. 1 is the flow diagram for the classification method without label corpus that one embodiment of the invention provides.
Fig. 2 is the flow diagram for the classification method without label corpus that yet another embodiment of the invention provides.
Fig. 3 is the structural schematic diagram for the first training pattern that one embodiment of the invention provides.
Fig. 4 be another embodiment of the present invention provides the classification method without label corpus flow diagram.
Fig. 5 is the flow diagram for the classification method without label corpus that further embodiment of this invention provides.
Fig. 6 is the structural schematic diagram for the Seq2Seq model that one embodiment of the invention provides.
Fig. 7 is that the textual classification model based on no label corpus that one embodiment of the invention provides is established the structure of system and shown
It is intended to.
Fig. 8 is the structural schematic diagram for the sorter without label corpus that one embodiment of the invention provides.
Fig. 9 be another embodiment of the present invention provides the sorter without label corpus structural schematic diagram.
Figure 10 is the structural schematic diagram for the sorter without label corpus that yet another embodiment of the invention provides.
Figure 11 is the structural schematic diagram for the sorter without label corpus that further embodiment of this invention provides.
Figure 12 is the entity structure schematic diagram for the electronic equipment that one embodiment of the invention provides.
Specific embodiment
Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, with reference to the accompanying drawing to this hair
Bright embodiment is described in further details.Here, the illustrative embodiments of the present invention and their descriptions are used to explain the present invention, but simultaneously
It is not as a limitation of the invention.
Fig. 1 is the flow diagram for the classification method without label corpus that one embodiment of the invention provides, as shown in Figure 1,
The classification method of no label corpus provided in an embodiment of the present invention, comprising:
S101, it obtains without label corpus, the no label corpus includes at least one problem;
Specifically, robot is still talked in either artificial customer service, and voice can be all generated during for customer service
Dialogue, can be collected the problems in above-mentioned text by speech recognition technology by above-mentioned voice dialogue transcription at text
It rises as no label corpus, above-mentioned no label corpus is not by the text data of classification, and the no label corpus includes
At least one problem.
S102, each problem that the no label corpus includes is input to textual classification model, exports each problem pair
The label answered;Wherein, the textual classification model is the no label corpus based on obtaining after no label corpus sample training
Every corpus data in sample includes a problem and an answer.
Specifically, after obtaining the no label corpus, using each problem in the no label corpus as text
The input of disaggregated model can export the corresponding label of each problem, the label by the processing of the textual classification model
For identifying type belonging to described problem.The textual classification model be based on being obtained after no label corpus sample training,
Every corpus data in the no label corpus sample includes a problem and an answer, and the answer is to described problem
Answer, it is also assumed that it is described answer described problem is marked, the corpus number that the no label corpus sample includes
According to quantity be configured according to actual needs, the embodiment of the present invention is without limitation.The specific training of the textual classification model
Process sees below described, herein without repeating.The executing subject of the embodiment of the present invention includes but is not limited to computer.
The classification method of no label corpus provided in an embodiment of the present invention, can obtain no label corpus, then will be without mark
Each problem that label corpus includes is input to based on the textual classification model obtained after no label corpus sample training, and output is each
The corresponding label of problem is difficult to obtain or in the case where lazy weight, can pass through no label in the mark sample of high quality
The textual classification model obtained after corpus sample training classifies to no label corpus, improves to the classification of no label corpus
Accuracy.
Fig. 2 is the flow diagram for the classification method without label corpus that yet another embodiment of the invention provides, such as Fig. 2 institute
Show, includes: based on the step of no label corpus sample training acquisition textual classification model
S201, coding-decoding frame model is trained based on the no label corpus sample, obtains pre-training mould
Type, the pre-training model includes coding layer;
Specifically, no label corpus sample is input in coding-decoding frame model, to coding-decoding frame
Model is trained, and available pre-training model, the pre-training model includes coding layer.So-called coding, being exactly will input
The Sequence Transformed vector at a regular length;Decoding, is exactly then converted into output for the vector of the regular length generated before
Sequence, in the training process of coding-decoding frame model, every corpus data in the no label corpus sample includes
The problem of corresponding coding, the corresponding decoding of the answer that every corpus data includes, it is hereby achieved that characterization and generalization ability are preferable
Semantic coding.Wherein, Encoder-Decoder frame is a model framework in deep learning, Encoder-Decoder
The model of frame includes but is not limited to sequence to sequence (Sequence to Sequence, abbreviation Seq2Seq) model.
For example, can record to the customer service of preservation to obtain the no label corpus sample, utilize speech recognition technology
Customer service recording is carried out offline transcription into text by (Automatic Speech Recognition, abbreviation ASR), obtains original language
Material;Then by the session operational scenarios in above-mentioned original language material, artificial correction process is carried out, obtains the no label voice sample, institute
Stating check and correction includes but is not limited to error correction, and alignment sentence etc. is handled, every corpus number in the no label corpus sample of acquisition
According to including a problem and an answer.The corpus data is such as are as follows: problem: you are good, and it is more for may I ask the housing loan interest rate of bank
It is few? answer: you are good, and current interest rate is 5.6%.Does is alternatively, problem: my credit card amount how many now? answer: you are good, you
Current amount is 50,000 RMB.
S202, it is sampled processing to the no label corpus sample, obtains the first training sample, the first training sample
Every training sample data in this include a problem and corresponding label;
Specifically, after obtaining the no label corpus sample, processing is sampled to the no label corpus sample,
Available first training sample, every training sample data in first training sample include a problem, Yi Jiyu
The corresponding label of described problem.For the label for identifying type belonging to described problem, the label is preset.Every institute
The corpus data in the corresponding no label corpus sample of training sample data is stated, what the training sample data included asks
The problem of topic includes with corresponding corpus data is identical.It will be appreciated that step S202 and step S201 do not have sequencing pass
System.
For example, obtaining the classification of pre-set categories without label corpus sample firstly, cluster to the no label corpus sample
This;Then, every class is sampled without label corpus sample, obtains original sample, then by manually marking to original sample
Note, mark is classified for every corpus data in original sample, obtains the corresponding label of every corpus data;According to process
The problem of every corpus data of the original sample of mark includes and the corresponding label of every corpus data obtain first instruction
Practice sample.Wherein, the pre-set categories are configured according to actual needs, and the embodiment of the present invention is without limitation;To every class without mark
The ratio or quantity that label corpus sample is sampled, are configured, the embodiment of the present invention is without limitation according to actual needs.
S203, the first training pattern is trained based on first training sample, obtains preliminary classification model;Its
In, first training pattern includes the coding layer of classification layer and the pre-training mould, in the instruction to first training pattern
The parameter of the coding layer remains unchanged during white silk;
Specifically, the first training pattern is established, first training pattern includes the coding of the pre-training model
(Encoder) layer and classification layer, input of described Encoder layers of the output as the classification layer, the classification layer can be adopted
With Softmax algorithm.The problem of including using every training sample data of first training sample is as first training
The input of model, the label for including using every training sample data of first training sample is as first training
The output of model, and Encoder layers of parameter constant is kept, first training pattern is trained, it is available initial
Disaggregated model, the preliminary classification model will include Encoder layers.In the training process to first training pattern, protect
Encoder layers of parameter constant is held, training obtains the parameter of the classification layer, had both been utilized between no label corpus sample in this way
Relationship, and can have supervision training classification layer, complete text categorization task, can use less sample training obtain it is extensive
The higher preliminary classification model of property.Since the first training pattern uses the coding layer of pre-training model, it is possible to reduce model training
When the first training sample for needing in training sample data quantity, save the cost for obtaining the first training sample and time, mention
The training effectiveness of high first training pattern also just improves the acquisition efficiency of textual classification model.
For example, Fig. 3 is the structural schematic diagram for the first training pattern that one embodiment of the invention provides, as shown in figure 3, described
First training pattern includes embedding layers, Encoder layers and Softmax layers of Word.By the every of first training sample
The problem of training sample data include, such as described problem are as follows: you are good, and may I ask you is XXX, is input to Word
Embedding layers, embedding layers of the Word term vector that each word in the above problem is converted into regular length are then defeated
Encoder layers are arrived out, and Encoder layers handle the term vector of input, and to Softmax layers, state becomes output state variable C
C is measured as Softmax layers of initial value, Softmax layers can export the corresponding label of the above problem after training, such as defeated
Go wrong: you are good, and may I ask you is XXX, and corresponding label label_ asks whether me.
S204, it is based on remaining no label corpus sample and the preliminary classification model, obtains supplementary training sample;Its
In, the remaining no label corpus sample is that removal is corresponding with first training sample from the no label corpus sample
Without obtaining after label corpus sample;
It specifically, can be to remaining nothing using the preliminary classification model after obtaining the preliminary classification model
Label corpus sample is labeled, and obtains the corresponding label of every corpus data of remaining no label corpus sample, i.e., remaining
The input as the preliminary classification model of every corpus data without label corpus sample the problem of including, can obtain surplus
The remaining corresponding label of every corpus data without label corpus sample.Then, according to the every of remaining no label corpus sample
The problem of corpus data includes and corresponding label, can obtain supplementary training sample, and every of the supplementary training sample
Training sample data include a problem and corresponding label.Wherein, the remaining no label corpus sample is from the nothing
It is obtained after removal no label corpus sample corresponding with first training sample in label corpus sample, first training
The corpus data of the corresponding no label corpus sample of every training sample data of sample.
S205, the preliminary classification model is trained based on the second training sample, obtains the textual classification model;
Wherein, second training sample includes first training sample and the supplementary training sample, to the preliminary classification
In the training process of model, the parameter for the coding layer that the preliminary classification model includes is remained unchanged.
Specifically, the supplementary training sample and first training sample are combined, as the second training sample,
Every training sample data of second training sample include a problem and corresponding label.By second training sample
The input as the preliminary classification model of every training sample data the problem of including, by the every of second training sample
Output of the label that training sample data include as the preliminary classification model, and keep described initial in the training process
The parameter constant of the Encoder layer of disaggregated model, is trained the preliminary classification model, available text classification mould
Type.
The classification method of no label corpus provided in an embodiment of the present invention, in the training sample feelings without largely there is label
Under condition, it will be able to establish textual classification model, improve the accuracy of textual classification model foundation.Due to no label corpus sample
Obtain very convenient, the classification method of no label corpus provided in an embodiment of the present invention can be difficult in the mark sample of high quality
In the case where acquisition or lazy weight, the problem of still can establish textual classification model, avoid over-fitting.
Fig. 4 be another embodiment of the present invention provides the classification method without label corpus flow diagram, such as Fig. 4 institute
Show, it is further, described that processing is sampled to the no label corpus sample on the basis of the various embodiments described above, it obtains
First training sample includes:
S2021, the no label corpus sample is clustered, obtain pre-set categories without label corpus sample;
Specifically, the no label corpus sample can be clustered using clustering algorithm, obtains the nothing of pre-set categories
Label corpus sample, i.e., each pre-set categories correspond to the corpus data without label corpus sample described in certain amount, the no mark
Every corpus data of label corpus sample can correspond to the pre-set categories.Wherein, the pre-set categories are based on practical experience
It is configured, the embodiment of the present invention is without limitation;The clustering algorithm can use LDA (Latent Dirichlet
) or K-means clustering algorithm Allocation.It will be appreciated that in cluster, the pre-set categories are the quantity of classification, often
The corresponding label of a pre-set categories is not aware that.
S2022, the no label corpus sample of each pre-set categories is sampled, obtains original sample;
Specifically, in order to reduce the no label corpus sample being labeled corpus data quantity, to each institute
The no label corpus sample for stating pre-set categories is sampled, i.e., from the no label corpus sample of each pre-set categories
In this, a certain number of corpus samples are obtained, original sample is obtained.Wherein, the ratio or quantity of the sampling are according to reality
It needs to be configured, the embodiment of the present invention is without limitation.
S2023, the original sample marked according to process, obtain first training sample.
It specifically, can be manually to every corpus data root of the original sample after obtaining the original sample
According to semanteme, its pre-set categories is labeled, the corresponding label of every corpus data of the original sample is obtained, to obtain
By the original sample of mark.Obtain the problem of including by every corpus data of the original sample of mark and every corpus number
According to corresponding label, first training sample is obtained, every training sample data of first training sample include one
Problem and corresponding label.
Fig. 5 is the flow diagram for the classification method without label corpus that further embodiment of this invention provides, such as Fig. 5 institute
Show, it is further, described to be based on remaining no label corpus sample and the preliminary classification on the basis of the various embodiments described above
Model, obtaining supplementary training sample includes:
S2041, remaining no label corpus sample is labeled by the preliminary classification model, obtains remaining nothing
The corresponding label of every corpus data in label corpus sample;
Specifically, after obtaining the preliminary classification model, by every corpus number of remaining no label corpus sample
According to input the problem of including as the preliminary classification model, every corpus of remaining no label corpus sample can be obtained
The corresponding label of data.
S2042, according to the corresponding label of every corpus data and every corpus number in remaining no label corpus sample
The problem of according to including, obtain the supplementary training sample.
Specifically, the problem of including according to every corpus data of remaining no label corpus sample and every corpus data
Corresponding label, can obtain supplementary training sample, and every training sample data of the supplementary training sample include one and ask
One of topic and corresponding label, every corpus data of the remaining no label corpus sample and the supplementary training sample
Training sample data are corresponding, and the corpus data of the remaining no label corpus sample includes phase with corresponding training sample data
Same problem.
It is further, described to be arrived based on coding-decoding frame model using sequence on the basis of the various embodiments described above
Series model.
Specifically, described to arrive series model using sequence based on coding-decoding frame model, Seq2Seq model is applicable
In machine translation, text snippet, the scenes such as dialogue.The embodiment of the present invention obtains the pre-training using Seq2Seq model training
Model can preferably express the relationship between sentence and sentence, and the Encoder layer of the pre-training model can preferable table
Up to sentence.Seq2Seq model can be followed using length memory (Long Short-Term Memory, abbreviation LSTM) or gate
Ring element (Gated Recurrent Unit, abbreviation GRU) algorithm realizes Encoder layers and Decoder layers, can also use
Transformer algorithm realizes Encoder-Decoder layers.GRU algorithm relative to LSTM algorithm training speed faster, and
Transformer algorithm joined attention mechanism, and modelling effect is more preferably.
The Seq2Seq model realized using LSTM is illustrated below.Fig. 6 is that one embodiment of the invention provides
The structural schematic diagram of Seq2Seq model, as shown in fig. 6, Seq2Seq model includes two Embedding layers of Word, Encoder
Layer and Decoder layers.Embedding layers of Word for being converted into regular length term vector for each word for inputting sentence.
Encoder layers and Decoder layers are all made of LSTM realization.What every corpus data by the no label corpus sample included asks
Topic is input to Word Embedding layer 1, such as problem are as follows: you are good, and may I ask you is XXX, and Word Embedding layer 1 will be upper
The term vector that each word in problem is converted into regular length is stated, is then output to Encoder layers, Encoder layers to input
Term vector is handled, and output state variable C is to Decoder layers, as Decoder layers of initial value.The no label corpus
The answer that every corpus data of sample includes is input to Word Embedding layer 2, such as answers are as follows: I yes, Word
Each word in above-mentioned answer is converted into the term vector of regular length by Embedding layer 2, is then output to Decoder layers.
Decoder layers are exported by the study to state variable C, may finally export the corresponding answer of the above problem.Wherein, GO
It is spcial character with EOS, GO indicates the starting answered, and EOS indicates the end answered.
Fig. 7 is that the textual classification model based on no label corpus that one embodiment of the invention provides is established the structure of system and shown
It is intended to, as shown in fig. 7, the textual classification model provided in an embodiment of the present invention based on no label corpus establishes system, Ke Yiyong
In establishing the textual classification model.The system comprises: dialogue corpus generating means 1, sample generating means 2, model training
Device 3, sample storage device 4 and model storage 5, in which:
Dialogue corpus generating means 1 are used to acquiring and generating no label corpus sample.Talking with corpus generating means 1 includes
Offline speech transcription unit 101 and check and correction unit 102.Offline speech transcription unit 101 for by the transcription of taped conversations corpus at
Text, check and correction unit 102 generate the no label corpus sample for proofreading the text after transcription.Proofreading unit 102 can be with
It sends the no label corpus sample of generation in sample storage device 4 and stores.Wherein, described proofread includes but is not limited to
Dialogue alignment, artificial error correction etc..The dialogue sample without label obtained by dialogue corpus generating means 1, example are as follows:
Problem: are you good, may I ask the housing loan interest rate of bank is how many? answer: you are good, and current interest rate is xx.
Does is problem: my credit card amount how many now? answer: you are good, your current amount is xxx.
Sample generating means 2 are used to generate the first training sample according to the no label corpus sample.Sample generating means 2
Including sample clustering unit 201, sample sampling unit 202 and sample mark unit 203.Sample clustering unit 201 is for using
Clustered described in clustering algorithm without label corpus sample, obtain pre-set categories without label corpus sample.Wherein, sample clustering
Unit 201 can get the no label corpus sample from sample storage device 4;The clustering algorithm can for LDA or
K-means algorithm.Sample sampling unit 202 is for taking out the no label corpus sample of each pre-set categories
Sample obtains original sample.Sample marks unit 203 and is used to obtain the first training sample according to the original sample by mark
This.Wherein, the original sample is labeled, it can be using manually to each pre-set categories mark label.Sample mark is single
Member 203 can send sample storage device 4 for first training sample and store, and prepare for subsequent initial model training.
The example of first training sample is as follows:
Are you good for label_ inquiry business, may I ask the housing loan interest rate of bank is how many?
Does is my credit card amount of label_ inquiry business how many now?
Label_ handles credit card business, and I wants to do a credit card.
Wherein, it is the label, root that label_ inquiry business, label_ inquiry business and label_, which handle credit card business,
It is configured according to practical experience.After label_ inquiry business, label_ inquiry business and label_ handle credit card business space
Sentence be each corresponding problem of label.
Model training apparatus 3 corresponds to model for obtaining according to training data training.Model training apparatus 3 includes pre- instruction
Practice model unit 301, initial model training unit 302 and final mask training unit 303.Pre-training model unit 301 can be with
The no label corpus sample is obtained from sample storage device 4, for using according to the no label corpus sample
The algorithm of Encoder-Decoder frame carries out model training, obtains pre-training model.Wherein, Encoder-Decoder frame
Algorithm can use Seq2Seq algorithm, Seq2Seq algorithm be usually used in machine translation, text snippet, the scenes such as dialogue.This hair
Bright embodiment uses Seq2Seq algorithm as pre- instruction model algorithm, and training, which obtains pre-training model, can preferably express sentence
Relationship between sentence, and Encoder layers can preferably express sentence.
Initial model training unit 302 is used for according to first training sample to including the pre-training model
Encoder layers and classification layer the first training pattern be trained, obtain preliminary classification model.Wherein, the Encoder in training
Layer parameter is fixed and invariable, and is trained to the parameter of classification layer.The each of the no label corpus sample had both been utilized in this way
Relationship between corpus data, and can have the training classifier of supervision, complete text categorization task.The method can use number
It measures less sample training and obtains the higher preliminary classification model of generalization.It, can be right after obtaining the preliminary classification model
Remaining no label corpus sample is labeled using preliminary classification model, may finally obtain supplementary training sample, will be described
First training sample and the supplementary training sample are combined, and obtain more more than the first training sample training data
Two training samples.Final mask training unit 303 is for carrying out the preliminary classification model according to second training sample
Training obtains textual classification model.Wherein, in training, the Encoder layer parameter of the preliminary classification model is to immobilize
, the parameter of the classification layer of the preliminary classification model is trained.
Sample storage device 4 includes unlabeled exemplars storage unit 401 for storing sample data, sample storage device 4
With sample data storage unit 402.Unlabeled exemplars storage unit 401 is for storing the no label corpus sample.Sample number
According to storage unit 402 for storing first training sample and second training sample.
Model storage 5 is used for storage model.Model storage 5 includes pre-training model storage unit 501, initially
Disaggregated model storage unit 502 and final mask storage unit 503.Pre-training model storage unit 501 is described pre- for storing
Training pattern, preliminary classification model storage unit 502 is for storing the preliminary classification model, final mask storage unit 503
For storing the textual classification model.
Wherein, the textual classification model is after being applied in practice, over time, can obtain more
Corpus data, so as to be carried out to the no label corpus sample, first training sample and second training sample
Expand.Can be used expand after without label corpus sample training, the coding-decoding frame model is instructed again
Practice, updates the pre-training model, can correspondingly update first training pattern after the pre-training model modification;It can be with
Using first training sample after expansion, updated first training pattern is trained again, is updated described first
Beginning disaggregated model;Second training sample after expanding can be used, the updated preliminary classification model is carried out
Training, updates the textual classification model, so that the mechanism that continuous iteration updates is set up, it can be continuous to textual classification model
Optimization updates, and further increases the accuracy of textual classification model.
Fig. 8 is the structural schematic diagram for the sorter without label corpus that one embodiment of the invention provides, as shown in figure 8,
The sorter of no label corpus provided in an embodiment of the present invention includes acquiring unit 801 and taxon 802, in which:
For acquiring unit 801 for obtaining without label corpus, the no label corpus includes at least one problem;Taxon
802, for each problem that no label corpus includes to be input to textual classification model, export the corresponding label of each problem;Its
In, the textual classification model be based on being obtained after no label corpus sample training, it is every in the no label corpus sample
Corpus data includes a problem and an answer.
Specifically, robot is still talked in either artificial customer service, and voice can be all generated during for customer service
Dialogue, can be collected the problems in above-mentioned text by speech recognition technology by above-mentioned voice dialogue transcription at text
It rises as no label corpus, above-mentioned no label corpus is not by the text data of classification, and acquiring unit 801 can obtain
The no label corpus, the no label corpus includes at least one problem.
After obtaining the no label corpus, taxon 802 using each problem in the no label corpus as
The input of textual classification model can export the corresponding label of each problem by the processing of the textual classification model, described
Label is for identifying type belonging to described problem.The textual classification model is based on obtaining after no label corpus sample training
, every corpus data in the no label corpus sample includes a problem and an answer, and the answer is to described
The answer of problem, it is also assumed that the answer marks described problem, the language that the no label corpus sample includes
The quantity of material data is configured according to actual needs, and the embodiment of the present invention is without limitation.The textual classification model it is specific
Training process sees below described, herein without repeating.
The sorter of no label corpus provided in an embodiment of the present invention, can obtain no label corpus, then will be without mark
Each problem that label corpus includes is input to based on the textual classification model obtained after no label corpus sample training, and output is each
The corresponding label of problem is difficult to obtain or in the case where lazy weight, can pass through no label in the mark sample of high quality
The textual classification model obtained after corpus sample training classifies to no label corpus, improves to the classification of no label corpus
Accuracy.
Fig. 9 be another embodiment of the present invention provides the sorter without label corpus structural schematic diagram, such as Fig. 9 institute
Show, the sorter of no label corpus provided in an embodiment of the present invention further include pre-training unit 803, sample obtaining unit 804,
Initial training unit 805, sample supplementary units 806 and model foundation unit 807, in which:
Pre-training unit 803 is obtained for being trained based on no label corpus sample to coding-decoding frame model
Pre-training model, the pre-training model includes coding layer;Sample obtaining unit 804 be used for the no label corpus sample into
Line sampling processing obtains the first training sample, and every training sample data in first training sample include a problem
With corresponding label;Initial training unit 805 is used to be trained the first training pattern based on first training sample, obtains
Obtain preliminary classification model;Wherein, first training pattern includes the coding layer of classification layer and the pre-training mould, to described
The parameter of coding layer described in the training process of first training pattern remains unchanged;Sample supplementary units 806 are used for based on residue
Without label corpus sample and the preliminary classification model, obtain supplementary training sample;Wherein, the remaining no label corpus
Sample is obtained after removing no label corpus sample corresponding with first training sample in the no label corpus sample
's;Model foundation unit 807 is used to be trained the preliminary classification model based on the second training sample, obtains the text
Disaggregated model;Wherein, second training sample includes first training sample and the supplementary training sample, to described
In the training process of preliminary classification model, the parameter of the coding layer of the preliminary classification model is remained unchanged.
Specifically, no label corpus sample is input in coding-decoding frame model by pre-training unit 803, to volume
The model of code-decoding frame is trained, and available pre-training model, the pre-training model includes coding layer.So-called volume
List entries, is exactly converted to the vector of a regular length by code;Decoding, exactly by the vector of the regular length generated before
It is then converted into output sequence, in the training process of coding-decoding frame model, every in the no label corpus sample
The corresponding coding of the problem of corpus data includes, the corresponding decoding of the answer that every corpus data includes, it is hereby achieved that characterization and
The preferable semantic coding of generalization ability.Wherein, Encoder-Decoder frame is a model framework in deep learning,
The model of Encoder-Decoder frame includes but is not limited to sequence to series model.
After obtaining the no label corpus sample, sample obtaining unit 804 carries out the no label corpus sample
Sample process, available first training sample, every training sample data in first training sample include one and ask
Topic, and label corresponding with described problem.For the label for identifying type belonging to described problem, the label is default
's.Corpus data in the corresponding no label corpus sample of every training sample data, the number of training
According to identical the problem of including with corresponding corpus data the problem of including.
Specifically, initial training unit 805 establishes the first training pattern, and first training pattern includes the pre-training
The coding layer and classification layer of model, input of described Encoder layers of the output as the classification layer, the classification layer can be adopted
With Softmax algorithm.The problem of every training sample data of first training sample are included by initial training unit 805 work
For the input of first training pattern, the label for including using every training sample data of first training sample is as institute
The output of first training pattern is stated, and keeps Encoder layers of parameter constant, first training pattern is instructed
Practice, available preliminary classification model, the preliminary classification model will include Encoder layers.To first training pattern
Training process in, keep Encoder layer of parameter constant, training obtains the parameter of the layer of classifying, nothing is both utilized in this way
Relationship between label corpus sample, and can have the training classification layer of supervision, text categorization task is completed, can use less
Sample training obtains the higher preliminary classification model of generalization.
After obtaining the preliminary classification model, sample supplementary units 806 can be right using the preliminary classification model
Remaining no label corpus sample is labeled, and obtains the corresponding mark of every corpus data of remaining no label corpus sample
Input of the problem of label, i.e., every corpus data of remaining no label corpus sample includes as the preliminary classification model,
The corresponding label of every corpus data of remaining no label corpus sample can be obtained.Then, 806 basis of sample supplementary units
The problem of every corpus data of remaining no label corpus sample includes and corresponding label, can obtain supplementary training sample
This, every training sample data of the supplementary training sample include a problem and corresponding label.Wherein, described remaining
No label corpus sample is that no label corpus corresponding with first training sample is removed from the no label corpus sample
It is obtained after sample, the language of the corresponding no label corpus sample of every training sample data of first training sample
Expect data.
The supplementary training sample and first training sample are combined by model foundation unit 807, as second
Training sample, every training sample data of second training sample include a problem and corresponding label.Model foundation
The problem of every training sample data of second training sample are included by unit 807 is as the defeated of the preliminary classification model
Enter, the label for including using every training sample data of second training sample as the output of the preliminary classification model,
And keep the parameter constant of the Encoder layer of the preliminary classification model in the training process, to the preliminary classification model into
Row training, available textual classification model.
The sorter of no label corpus provided in an embodiment of the present invention, in the training sample feelings without largely there is label
Under condition, it will be able to establish textual classification model, improve the accuracy of textual classification model foundation.Due to no label corpus sample
Obtain very convenient, the sorter of no label corpus provided in an embodiment of the present invention can be difficult in the mark sample of high quality
In the case where acquisition or lazy weight, the problem of still can establish textual classification model, avoid over-fitting.
Figure 10 is the structural schematic diagram for the sorter without label corpus that yet another embodiment of the invention provides, such as Figure 10 institute
Show, sample obtaining unit 804 includes cluster subelement 8041, sub-unit 8042 and obtains subelement 8043, in which:
Cluster subelement 8041 for being clustered to the no label corpus sample, obtain pre-set categories without label language
Expect sample;Sub-unit 8042 is obtained for being sampled to the no label corpus sample of each pre-set categories
Original sample;Subelement 8043 is obtained to be used to obtain first training sample according to the original sample by mark.
Specifically, cluster subelement 8041 can cluster the no label corpus sample using clustering algorithm, obtain
Pre-set categories without label corpus sample, i.e., each pre-set categories correspond to the corpus without label corpus sample described in certain amount
Every corpus data of data, the no label corpus sample can correspond to the pre-set categories.Wherein, the pre-set categories
It is configured based on practical experience, the embodiment of the present invention is without limitation;The clustering algorithm can use LDA (Latent
Dirichlet Allocation) or K-means clustering algorithm.It will be appreciated that the pre-set categories are point in cluster
The quantity of class, the corresponding label of each pre-set categories are not aware that.
In order to reduce the no label corpus sample being labeled corpus data quantity, sub-unit 8042 is right
The no label corpus sample of each pre-set categories is sampled, i.e., from the no label of each pre-set categories
In corpus sample, a certain number of corpus samples are obtained, original sample is obtained.Wherein, the ratio of the sampling or quantity root
It is configured according to actual needs, the embodiment of the present invention is without limitation.
After obtaining the original sample, can manually to every corpus data of the original sample according to semanteme,
Its pre-set categories is labeled, the corresponding label of every corpus data of the original sample is obtained, to obtain by mark
The original sample of note.It obtains subelement 8043 and obtains the problem of including by every corpus data of original sample of mark and every
Bar corresponding label of corpus data, obtains first training sample, every training sample data of first training sample
Including a problem and corresponding label.
Figure 11 is the structural schematic diagram for the sorter without label corpus that further embodiment of this invention provides, such as Figure 11 institute
Show, on the basis of the various embodiments described above, further, sample supplementary units 806 include mark subelement 8061 and supplement
Unit 8062, in which:
Mark subelement 8061 is used to be labeled remaining no label corpus sample by the preliminary classification model,
Obtain the corresponding label of every corpus data in remaining no label corpus sample;Subelement 8062 is supplemented to be used for according to remaining
The problem of corresponding label of every corpus data and every corpus data include in no label corpus sample, obtains the supplement
Training sample.
Specifically, after obtaining the preliminary classification model, subelement 8061 is marked by remaining no label corpus sample
Input of the problem of this every corpus data includes as the preliminary classification model, can obtain remaining no label corpus
The corresponding label of every corpus data of sample.
Supplement the problem of subelement 8062 includes according to every corpus data of remaining no label corpus sample and every
The corresponding label of corpus data can obtain supplementary training sample, every training sample data packet of the supplementary training sample
Include a problem and corresponding label, the every corpus data and the supplementary training sample of the remaining no label corpus sample
This training sample data are corresponding, the corpus data of the remaining no label corpus sample and corresponding number of training
According to including identical problem.
On the basis of the various embodiments described above, further, described based on coding-decoding frame model includes that sequence arrives
Series model.
Specifically, described to arrive series model using sequence based on coding-decoding frame model, Seq2Seq model is applicable
In machine translation, text snippet, the scenes such as dialogue.The embodiment of the present invention obtains the pre-training using Seq2Seq model training
Model can preferably express the relationship between sentence and sentence, and the Encoder layer of the pre-training model can preferable table
Up to sentence.Seq2Seq model can realize Encoder layers and Decoder layers using LSTM GRU algorithm, can also use
Transformer algorithm realizes Encoder-Decoder layers.GRU algorithm relative to LSTM algorithm training speed faster, and
Transformer algorithm joined attention mechanism, and modelling effect is more preferably.
Figure 12 is the entity structure schematic diagram for the electronic equipment that one embodiment of the invention provides, as shown in figure 12, the electronics
Equipment may include: processor (processor) 1201, communication interface (Communications Interface) 1202, deposit
Reservoir (memory) 1203 and communication bus 1204, wherein processor 1201, communication interface 1202, memory 1203 pass through logical
Letter bus 1204 completes mutual communication.Processor 1201 can call the logical order in memory 1203, to execute such as
Lower method: it obtains without label corpus, the no label corpus includes at least one problem;By the no label corpus include it is every
A problem is input to textual classification model, exports the corresponding label of each problem;Wherein, the textual classification model is based on nothing
It is obtained after label corpus sample training, every corpus data in the no label corpus sample includes a problem and one
It answers.
In addition, the logical order in above-mentioned memory 1203 can be realized by way of SFU software functional unit and conduct
Independent product when selling or using, can store in a computer readable storage medium.Based on this understanding, originally
Substantially the part of the part that contributes to existing technology or the technical solution can be in other words for the technical solution of invention
The form of software product embodies, which is stored in a storage medium, including some instructions to
So that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation of the present invention
The all or part of the steps of example the method.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (ROM,
Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic or disk etc. it is various
It can store the medium of program code.
The present embodiment discloses a kind of computer program product, and the computer program product includes being stored in non-transient calculating
Computer program on machine readable storage medium storing program for executing, the computer program include program instruction, when described program instruction is calculated
When machine executes, computer is able to carry out method provided by above-mentioned each method embodiment, for example, it obtains without label corpus,
The no label corpus includes at least one problem;Each problem that the no label corpus includes is input to text classification mould
Type exports the corresponding label of each problem;Wherein, the textual classification model is based on obtaining after no label corpus sample training
, every corpus data in the no label corpus sample includes a problem and an answer.
The present embodiment provides a kind of computer readable storage medium, the computer-readable recording medium storage computer journey
Sequence, the computer program make the computer execute method provided by above-mentioned each method embodiment, for example, obtain nothing
Label corpus, the no label corpus includes at least one problem;Each problem that the no label corpus includes is input to
Textual classification model exports the corresponding label of each problem;Wherein, the textual classification model is based on no label corpus sample
It is obtained after training, every corpus data in the no label corpus sample includes a problem and an answer.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In the description of this specification, reference term " one embodiment ", " specific embodiment ", " some implementations
Example ", " such as ", the description of " example ", " specific example " or " some examples " etc. mean it is described in conjunction with this embodiment or example
Particular features, structures, materials, or characteristics are included at least one embodiment or example of the invention.In the present specification,
Schematic expression of the above terms may not refer to the same embodiment or example.Moreover, the specific features of description, knot
Structure, material or feature can be combined in any suitable manner in any one or more of the embodiments or examples.
Particular embodiments described above has carried out further in detail the purpose of the present invention, technical scheme and beneficial effects
Describe in detail it is bright, it should be understood that the above is only a specific embodiment of the present invention, the guarantor being not intended to limit the present invention
Range is protected, all within the spirits and principles of the present invention, any modification, equivalent substitution, improvement and etc. done should be included in this
Within the protection scope of invention.
Claims (12)
1. a kind of classification method of no label corpus characterized by comprising
It obtains without label corpus, the no label corpus includes at least one problem;
Each problem that the no label corpus includes is input to textual classification model, exports the corresponding label of each problem;
Wherein, the textual classification model is based on obtaining after no label corpus sample training, in the no label corpus sample
Every corpus data includes a problem and an answer.
2. the method according to claim 1, wherein obtaining the text point based on no label corpus sample training
The step of class model includes:
Coding-decoding frame model is trained based on the no label corpus sample, obtains pre-training model, it is described pre-
Training pattern includes coding layer;
Processing is sampled to the no label corpus sample, obtains the first training sample, it is every in first training sample
Training sample data include a problem and corresponding label;
The first training pattern is trained based on first training sample, obtains preliminary classification model;Wherein, described first
Training pattern includes the coding layer of classification layer and the pre-training mould, described in the training process to first training pattern
The parameter of coding layer remains unchanged;
Based on remaining no label corpus sample and the preliminary classification model, supplementary training sample is obtained;Wherein, the residue
Be that corresponding with first training sample no label language is removed from the no label corpus sample without label corpus sample
It is obtained after material sample;
The preliminary classification model is trained based on the second training sample, obtains the textual classification model;Wherein, described
Second training sample includes first training sample and the supplementary training sample, in the training to the preliminary classification model
In the process, the parameter of the coding layer of the preliminary classification model remains unchanged.
3. according to the method described in claim 2, it is characterized in that, described be sampled place to the no label corpus sample
Reason, obtaining the first training sample includes:
The no label corpus sample is clustered, obtain pre-set categories without label corpus sample;
The no label corpus sample of each pre-set categories is sampled, original sample is obtained;
According to the original sample by mark, first training sample is obtained.
4. according to the method described in claim 2, it is characterized in that, described based on remaining no label corpus sample and described first
Beginning disaggregated model, obtaining supplementary training sample includes:
Remaining no label corpus sample is labeled by the preliminary classification model, obtains remaining no label corpus sample
The corresponding label of every corpus data in this;
It is asked according to what the corresponding label of every corpus data and every corpus data in remaining no label corpus sample included
Topic, obtains the supplementary training sample.
5. method according to any one of claims 1 to 4, which is characterized in that described based on coding-decoding frame model
Using sequence to series model.
6. a kind of sorter of no label corpus characterized by comprising
Acquiring unit, for obtaining without label corpus, the no label corpus includes at least one problem;
Taxon, each problem for including by no label corpus are input to textual classification model, export each problem pair
The label answered;Wherein, the textual classification model is the no label corpus based on obtaining after no label corpus sample training
Every corpus data in sample includes a problem and an answer.
7. device according to claim 6, which is characterized in that further include:
Pre-training unit is obtained pre- for being trained based on the no label corpus sample to coding-decoding frame model
Training pattern, the pre-training model includes coding layer;
Sample obtaining unit obtains the first training sample for being sampled processing to the no label corpus sample, and described the
Every training sample data in one training sample include a problem and corresponding label;
Initial training unit obtains preliminary classification for being trained based on first training sample to the first training pattern
Model;Wherein, first training pattern includes the coding layer of classification layer and the pre-training mould, to the first training mould
The parameter of coding layer described in the training process of type remains unchanged;
Sample supplementary units obtain supplementary training for being based on remaining no label corpus sample and the preliminary classification model
Sample;Wherein, the remaining no label corpus sample is removal and first training from the no label corpus sample
It is obtained after the corresponding no label corpus sample of sample;
Model foundation unit obtains the text for being trained based on the second training sample to the preliminary classification model
Disaggregated model;Wherein, second training sample includes first training sample and the supplementary training sample, to described
In the training process of preliminary classification model, the parameter of the coding layer of the preliminary classification model is remained unchanged.
8. device according to claim 7, which is characterized in that the sample obtaining unit includes:
Cluster subelement, for being clustered to the no label corpus sample, obtain pre-set categories without label corpus sample;
Sub-unit is sampled for the no label corpus sample to each pre-set categories, obtains original sample
This;
Subelement is obtained, for obtaining first training sample according to the original sample by mark.
9. device according to claim 7, which is characterized in that the sample supplementary units include:
Mark subelement is remained for being labeled by the preliminary classification model to remaining no label corpus sample
It is remaining without the corresponding label of every corpus data in label corpus sample;
Subelement is supplemented, for according to the corresponding label of every corpus data and every language in remaining no label corpus sample
The problem of material data include, obtains the supplementary training sample.
10. according to the described in any item devices of claim 6 to 9, which is characterized in that described based on coding-decoding frame mould
Type is using sequence to series model.
11. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor
Machine program, which is characterized in that the processor realizes the step of any one of claim 1 to 5 the method when executing described program
Suddenly.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
The step of any one of claim 1 to 5 the method is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910602361.6A CN110297909B (en) | 2019-07-05 | 2019-07-05 | Method and device for classifying unlabeled corpora |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910602361.6A CN110297909B (en) | 2019-07-05 | 2019-07-05 | Method and device for classifying unlabeled corpora |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110297909A true CN110297909A (en) | 2019-10-01 |
CN110297909B CN110297909B (en) | 2021-07-02 |
Family
ID=68030372
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910602361.6A Active CN110297909B (en) | 2019-07-05 | 2019-07-05 | Method and device for classifying unlabeled corpora |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110297909B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110888968A (en) * | 2019-10-15 | 2020-03-17 | 浙江省北大信息技术高等研究院 | Customer service dialogue intention classification method and device, electronic equipment and medium |
CN111125365A (en) * | 2019-12-24 | 2020-05-08 | 京东数字科技控股有限公司 | Address data labeling method and device, electronic equipment and storage medium |
CN111198948A (en) * | 2020-01-08 | 2020-05-26 | 深圳前海微众银行股份有限公司 | Text classification correction method, device and equipment and computer readable storage medium |
CN111506732A (en) * | 2020-04-20 | 2020-08-07 | 北京中科凡语科技有限公司 | Text multi-level label classification method |
CN111554270A (en) * | 2020-04-29 | 2020-08-18 | 北京声智科技有限公司 | Training sample screening method and electronic equipment |
CN111626063A (en) * | 2020-07-28 | 2020-09-04 | 浙江大学 | Text intention identification method and system based on projection gradient descent and label smoothing |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368182A (en) * | 2016-08-19 | 2017-11-21 | 北京市商汤科技开发有限公司 | Gestures detection network training, gestures detection, gestural control method and device |
CN107679734A (en) * | 2017-09-27 | 2018-02-09 | 成都四方伟业软件股份有限公司 | It is a kind of to be used for the method and system without label data classification prediction |
CN107818080A (en) * | 2017-09-22 | 2018-03-20 | 新译信息科技(北京)有限公司 | Term recognition methods and device |
CN108319599A (en) * | 2017-01-17 | 2018-07-24 | 华为技术有限公司 | A kind of interactive method and apparatus |
CN108509596A (en) * | 2018-04-02 | 2018-09-07 | 广州市申迪计算机系统有限公司 | File classification method, device, computer equipment and storage medium |
US20180329884A1 (en) * | 2017-05-12 | 2018-11-15 | Rsvp Technologies Inc. | Neural contextual conversation learning |
CN109189901A (en) * | 2018-08-09 | 2019-01-11 | 北京中关村科金技术有限公司 | Automatically a kind of method of the new classification of discovery and corresponding corpus in intelligent customer service system |
CN109308316A (en) * | 2018-07-25 | 2019-02-05 | 华南理工大学 | A kind of adaptive dialog generation system based on Subject Clustering |
CN109446302A (en) * | 2018-09-25 | 2019-03-08 | 中国平安人寿保险股份有限公司 | Question and answer data processing method, device and computer equipment based on machine learning |
-
2019
- 2019-07-05 CN CN201910602361.6A patent/CN110297909B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107368182A (en) * | 2016-08-19 | 2017-11-21 | 北京市商汤科技开发有限公司 | Gestures detection network training, gestures detection, gestural control method and device |
CN108319599A (en) * | 2017-01-17 | 2018-07-24 | 华为技术有限公司 | A kind of interactive method and apparatus |
US20180329884A1 (en) * | 2017-05-12 | 2018-11-15 | Rsvp Technologies Inc. | Neural contextual conversation learning |
CN107818080A (en) * | 2017-09-22 | 2018-03-20 | 新译信息科技(北京)有限公司 | Term recognition methods and device |
CN107679734A (en) * | 2017-09-27 | 2018-02-09 | 成都四方伟业软件股份有限公司 | It is a kind of to be used for the method and system without label data classification prediction |
CN108509596A (en) * | 2018-04-02 | 2018-09-07 | 广州市申迪计算机系统有限公司 | File classification method, device, computer equipment and storage medium |
CN109308316A (en) * | 2018-07-25 | 2019-02-05 | 华南理工大学 | A kind of adaptive dialog generation system based on Subject Clustering |
CN109189901A (en) * | 2018-08-09 | 2019-01-11 | 北京中关村科金技术有限公司 | Automatically a kind of method of the new classification of discovery and corresponding corpus in intelligent customer service system |
CN109446302A (en) * | 2018-09-25 | 2019-03-08 | 中国平安人寿保险股份有限公司 | Question and answer data processing method, device and computer equipment based on machine learning |
Non-Patent Citations (2)
Title |
---|
PRAJIT RMACHANDRAN 等: "Unsupervised Pretraining for Sequence to Sequence Learning", 《UNDER REVIEW AS A CONFERENCE PAPER AT ICLR 2017》 * |
REDDITNOTE: "google开源seq2seq,通用编码器&解码器框架", 《HTTPS://BLOG.CSDN.NET/REDDITNOTE/ARTICLE/DETAILS/102589845》 * |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110888968A (en) * | 2019-10-15 | 2020-03-17 | 浙江省北大信息技术高等研究院 | Customer service dialogue intention classification method and device, electronic equipment and medium |
CN111125365A (en) * | 2019-12-24 | 2020-05-08 | 京东数字科技控股有限公司 | Address data labeling method and device, electronic equipment and storage medium |
CN111198948A (en) * | 2020-01-08 | 2020-05-26 | 深圳前海微众银行股份有限公司 | Text classification correction method, device and equipment and computer readable storage medium |
CN111506732A (en) * | 2020-04-20 | 2020-08-07 | 北京中科凡语科技有限公司 | Text multi-level label classification method |
CN111506732B (en) * | 2020-04-20 | 2023-05-26 | 北京中科凡语科技有限公司 | Text multi-level label classification method |
CN111554270A (en) * | 2020-04-29 | 2020-08-18 | 北京声智科技有限公司 | Training sample screening method and electronic equipment |
CN111554270B (en) * | 2020-04-29 | 2023-04-18 | 北京声智科技有限公司 | Training sample screening method and electronic equipment |
CN111626063A (en) * | 2020-07-28 | 2020-09-04 | 浙江大学 | Text intention identification method and system based on projection gradient descent and label smoothing |
CN111626063B (en) * | 2020-07-28 | 2020-12-08 | 浙江大学 | Text intention identification method and system based on projection gradient descent and label smoothing |
Also Published As
Publication number | Publication date |
---|---|
CN110297909B (en) | 2021-07-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110297909A (en) | A kind of classification method and device of no label corpus | |
CN110377911B (en) | Method and device for identifying intention under dialog framework | |
US10909328B2 (en) | Sentiment adapted communication | |
CN110990543A (en) | Intelligent conversation generation method and device, computer equipment and computer storage medium | |
CN107680579A (en) | Text regularization model training method and device, text regularization method and device | |
CN107657017A (en) | Method and apparatus for providing voice service | |
CN107886949A (en) | A kind of content recommendation method and device | |
CN113268610B (en) | Intent jump method, device, equipment and storage medium based on knowledge graph | |
CN109408800A (en) | Talk with robot system and associative skills configuration method | |
CN111930914A (en) | Question generation method and device, electronic equipment and computer-readable storage medium | |
CN110704618B (en) | Method and device for determining standard problem corresponding to dialogue data | |
CN113450759A (en) | Voice generation method, device, electronic equipment and storage medium | |
CN110225210A (en) | Based on call abstract Auto-writing work order method and system | |
CN112860871B (en) | Natural language understanding model training method, natural language understanding method and device | |
CN110232914A (en) | A kind of method for recognizing semantics, device and relevant device | |
CN115455982A (en) | Dialogue processing method, dialogue processing device, electronic equipment and storage medium | |
CN115687934A (en) | Intention recognition method and device, computer equipment and storage medium | |
CN111368066B (en) | Method, apparatus and computer readable storage medium for obtaining dialogue abstract | |
CN114003700A (en) | Method and system for processing session information, electronic device and storage medium | |
CN107734123A (en) | A kind of contact sequencing method and device | |
CN111046674B (en) | Semantic understanding method and device, electronic equipment and storage medium | |
CN113486174A (en) | Model training, reading understanding method and device, electronic equipment and storage medium | |
CN110782221A (en) | Intelligent interview evaluation system and method | |
CN115934891A (en) | Question understanding method and device | |
CN114118068B (en) | Method and device for amplifying training text data and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |