Summary of the invention
This specification one or more embodiment describes a kind of model training method for text analyzing, text classification
Method and apparatus can utilize the fireballing spy of Transformer model running when carrying out depth modelling to sequence data
Point, and guarantee the robustness of model.
In a first aspect, providing a kind of model training method for text analyzing, method includes:
Using the first bidirectional transducer model, for each word in the first training sentence, based on the initial word of the word to
The information above of amount and the word in the first training sentence obtains the corresponding positive vector of the word;
Using the first bidirectional transducer model, for each word in the first training sentence, based on the word
The context information of initial term vector and the word in the first training sentence, obtains the corresponding opposite vector of the word;
It, should by the positive vector sum of the previous word of the position according to the position of each word in the first training sentence
The opposite vector of the latter word of position is stitched together, as the corresponding target term vector in the position;
Using first language model, for the corresponding target term vector in position each in the first training sentence, prediction
Obtain the first probability of the corresponding word in the position;
By making the first-loss Function Minimization with first probability correlation, to the first bidirectional transducer model
It is trained with the first language model, the second bidirectional transducer model and second language model after being trained.
It is described to utilize the first bidirectional transducer model in a kind of possible embodiment, for first instruction
Practice each word in sentence, initial term vector and the word based on the word train the context information in sentence described first,
Obtain the corresponding opposite vector of the word, comprising:
Attention certainly is used for each word in the first training sentence using the first bidirectional transducer model
Power mechanism, the context information of initial term vector and the word based on the word in the first training sentence, from different perspectives
Extract multiple important informations;
The corresponding vector of important information each in the multiple important information is spliced, it is corresponding reversed to obtain the word
Vector.
Second aspect, provides a kind of model training method for text analyzing, and method includes:
Using the second bidirectional transducer model after method training as described in relation to the first aspect, for the second training language
Each word in sentence, the information above of initial term vector and the word based on the word in the second training sentence, obtains
The corresponding positive vector of the word;
Using the second bidirectional transducer model, for each word in the second training sentence, based on the word
The context information of initial term vector and the word in the second training sentence, obtains the corresponding opposite vector of the word;
It, should by the positive vector sum of the previous word of the position according to the position of each word in the second training sentence
The opposite vector of the latter word of position is stitched together, as the corresponding target term vector in the position;
Using the second language model after method training as described in relation to the first aspect, for the second training sentence
In the corresponding target term vector in each position, prediction obtains the first probability of the corresponding word in the position;And according to described second
The corresponding target term vector in each position in training sentence generates the expression vector of the corresponding sentence of the second training sentence;
Using more disaggregated models, based on the expression vector of the corresponding sentence of the second training sentence, prediction described second
Second probability of training sentence corresponding label;
By make first-loss function and the second loss function and minimization, to the second bidirectional transducer model,
The second language model and more disaggregated models are trained, and obtain third bidirectional transducer model, third language model
With more than second disaggregated model;Wherein, the first-loss function and first probability correlation, second loss function and institute
State the second probability correlation.
It is described to utilize the second bidirectional transducer model in a kind of possible embodiment, for second instruction
Practice each word in sentence, initial term vector and the word based on the word train the context information in sentence described second,
Obtain the corresponding opposite vector of the word, comprising:
Attention certainly is used for each word in the second training sentence using the second bidirectional transducer model
Power mechanism, the context information of initial term vector and the word based on the word in the second training sentence, from different perspectives
Extract multiple important informations;
The corresponding vector of important information each in the multiple important information is spliced, it is corresponding reversed to obtain the word
Vector.
It is described according to the corresponding target word in position each in the second training sentence in a kind of possible embodiment
Vector generates the expression vector of the corresponding sentence of the second training sentence, comprising:
The corresponding target term vector in position each in the second training sentence is taken into mean value, using the mean value as described in
The expression vector of the corresponding sentence of second training sentence.
It is described to pass through make the first-loss function and the second loss function and pole in a kind of possible embodiment
Smallization is trained the second bidirectional transducer model, the second language model and more disaggregated models, comprising:
Make the first-loss function and the second loss function by gradient descent method and minimization, with determination described
The model parameter of two bidirectional transducer models, the second language model and more disaggregated models.
The third aspect, provides a kind of file classification method, and method includes:
Using the third bidirectional transducer model after the method training as described in second aspect, for sentence to be sorted
In each word, the information above of initial term vector and the word in the sentence to be sorted based on the word obtains the word
Corresponding forward direction vector;
It is first based on the word for each word in the sentence to be sorted using the third bidirectional transducer model
The context information of beginning term vector and the word in the sentence to be sorted, obtains the corresponding opposite vector of the word;
According to the position of each word in the sentence to be sorted, by the positive vector sum of the previous word of the position position
The opposite vector for the latter word set is stitched together, as the corresponding target term vector in the position;
According to the corresponding target term vector in position each in the sentence to be sorted, it is corresponding to generate the sentence to be sorted
The expression vector of sentence;
More than second disaggregated model as described in after being trained using the method as described in second aspect is based on the sentence to be sorted
The expression vector of corresponding sentence carries out text classification to the sentence to be sorted.
Fourth aspect, provides a kind of model training apparatus for text analyzing, and device includes:
Positive vector generation unit, for utilizing the first bidirectional transducer model, for each of first training sentence
Word, the information above of initial term vector and the word based on the word in the first training sentence, it is corresponding to obtain the word
Positive vector;
Opposite vector generation unit, for utilizing the first bidirectional transducer model, for the first training sentence
In each word, initial term vector and the word based on the word it is described first training sentence in context information, be somebody's turn to do
The corresponding opposite vector of word;
Term vector generation unit, for the position according to each word in the first training sentence, by it is described it is positive to
The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that amount generation unit obtains obtains
The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position;
Predicting unit is used to utilize first language model, first instruction obtained for the term vector generation unit
Practice the corresponding target term vector in each position in sentence, prediction obtains the first probability of the corresponding word in the position;
Model training unit, for the first-loss function by making the first probability correlation obtained with the predicting unit
Minimization is trained the first bidirectional transducer model and the first language model, second after being trained pair
To converter model and second language model.
5th aspect, provides a kind of model training apparatus for text analyzing, device includes:
Positive vector generation unit, for utilizing second bi-directional conversion after method training as described in relation to the first aspect
Device model, for each word in the second training sentence, initial term vector and the word based on the word are in second training
Information above in sentence obtains the corresponding positive vector of the word;
Opposite vector generation unit, for utilizing the second bidirectional transducer model, for the second training sentence
In each word, initial term vector and the word based on the word it is described second training sentence in context information, be somebody's turn to do
The corresponding opposite vector of word;
Term vector generation unit, for the position according to each word in the second training sentence, by it is described it is positive to
The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that amount generation unit obtains obtains
The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position;
First predicting unit, for utilizing the second language model after method training as described in relation to the first aspect, needle
Target term vector corresponding to position each in the second training sentence, predicts that obtain the corresponding word in the position first is general
Rate;
Sentence vector generation unit, described second for being obtained according to the term vector generation unit trains in sentence often
The corresponding target term vector in a position generates the expression vector of the corresponding sentence of the second training sentence;
Second predicting unit, for utilizing more disaggregated models, obtained based on the sentence vector generation unit described the
The expression vector of the corresponding sentence of two training sentences, predicts the second probability of the second training sentence corresponding label;
Model training unit, for by make first-loss function and the second loss function and minimization, to described
Two bidirectional transducer models, the second language model and more disaggregated models are trained, and obtain third bidirectional transducer
Model, third language model and more than second disaggregated model;Wherein, the first-loss function and first probability correlation, institute
State the second loss function and second probability correlation.
6th aspect, provides a kind of document sorting apparatus, device includes:
Positive vector generation unit, for utilizing the third bi-directional conversion after the method training as described in second aspect
Device model, for each word in sentence to be sorted, initial term vector and the word based on the word are in the sentence to be sorted
In information above, obtain the corresponding positive vector of the word;
Opposite vector generation unit, for utilizing the third bidirectional transducer model, in the sentence to be sorted
Each word, the context information of initial term vector and the word in the sentence to be sorted based on the word obtains the word pair
The opposite vector answered;
Term vector generation unit, for the position according to each word in the sentence to be sorted, by the positive vector
The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that generation unit obtains obtains
The opposite vector of the latter word is stitched together, as the corresponding target term vector in the position;
Sentence vector generation unit, it is each in the sentence to be sorted for being obtained according to the term vector generation unit
The corresponding target term vector in position generates the expression vector of the corresponding sentence of the sentence to be sorted;
Text classification unit, more than second disaggregated model as described in after being trained for method of the utilization as described in second aspect,
Expression vector based on the corresponding sentence of the sentence to be sorted that the sentence vector generation unit obtains, to described to be sorted
Sentence carries out text classification.
7th aspect, provides a kind of computer readable storage medium, is stored thereon with computer program, when the calculating
When machine program executes in a computer, the method that enables computer execute first aspect or second aspect or the third aspect.
Eighth aspect provides a kind of calculating equipment, including memory and processor, and being stored in the memory can hold
Line code, when the processor executes the executable code, the method for realization first aspect or second aspect or the third aspect.
The method and apparatus provided by this specification embodiment, on the one hand, first with the first bidirectional transducer model,
For each word in the first training sentence, initial term vector and the word based on the word are in the first training sentence
Information above, obtain the corresponding positive vector of the word;Followed by the first bidirectional transducer model, for described first
Each word in training sentence, the hereafter letter of initial term vector and the word based on the word in the first training sentence
Breath, obtains the corresponding opposite vector of the word;Then according to the position of each word in the first training sentence, by the position
The opposite vector of the latter word of the positive vector sum position of previous word is stitched together, as the corresponding target word in the position
Vector;First language model is recycled, for the corresponding target term vector in position each in the first training sentence, is measured in advance
To the first probability of the corresponding word in the position;Finally by the first-loss Function Minimization made with first probability correlation,
The first bidirectional transducer model and the first language model are trained, the second bidirectional transducer after being trained
Model and second language model.It is different from common unidirectional Transformer model in this specification embodiment, bi-directional conversion
Device model has fully considered the contextual information of each word, rather than only considers information above, builds carrying out depth to sequence data
When mould, the fireballing feature of Transformer model running can be utilized, and guarantee the robustness of model.
On the other hand, the second bidirectional transducer model after being trained first with method as described in relation to the first aspect,
For each word in the second training sentence, initial term vector and the word based on the word are in the second training sentence
Information above, obtain the corresponding positive vector of the word;Followed by the second bidirectional transducer model, for described second
Each word in training sentence, the hereafter letter of initial term vector and the word based on the word in the second training sentence
Breath, obtains the corresponding opposite vector of the word;Then according to the position of each word in the second training sentence, by the position
The opposite vector of the latter word of the positive vector sum position of previous word is stitched together, as the corresponding target word in the position
Vector;The second language model after recycling method training as described in relation to the first aspect, for the second training sentence
In the corresponding target term vector in each position, prediction obtains the first probability of the corresponding word in the position;And according to described second
The corresponding target term vector in each position in training sentence generates the expression vector of the corresponding sentence of the second training sentence;
Second training is predicted based on the expression vector of the corresponding sentence of the second training sentence followed by more disaggregated models
Second probability of sentence corresponding label;Finally by make first-loss function and the second loss function and minimization, to described
Second bidirectional transducer model, the second language model and more disaggregated models are trained, and obtain third bi-directional conversion
Device model, third language model and more than second disaggregated model;Wherein, the first-loss function and first probability correlation,
Second loss function and second probability correlation.In this specification embodiment, depth not only is being carried out to sequence data
When modeling, the fireballing feature of Transformer model running can be utilized, and guarantee the robustness of model;Moreover,
On the one hand on the basis of carrying out model training to bidirectional transducer model and language model, further to bidirectional transducer model,
Language model and more disaggregated models carry out joint training, reach better model training effect.
Another aspect, the third bidirectional transducer model after being trained first with the method as described in second aspect,
For each word in sentence to be sorted, initial term vector and the word based on the word are upper in the sentence to be sorted
Literary information obtains the corresponding positive vector of the word;Followed by the third bidirectional transducer model, for the language to be sorted
Each word in sentence, the context information of initial term vector and the word in the sentence to be sorted based on the word are somebody's turn to do
The corresponding opposite vector of word;Then according to the position of each word in the sentence to be sorted, by the previous word of the position
The opposite vector of the latter word of the positive vector sum position is stitched together, as the corresponding target term vector in the position;Root again
According to the corresponding target term vector in position each in the sentence to be sorted, the expression of the corresponding sentence of the sentence to be sorted is generated
Vector;Finally using more than second disaggregated model after the method training as described in second aspect, it is based on the language to be sorted
The expression vector of the corresponding sentence of sentence carries out text classification to the sentence to be sorted.In this specification embodiment, to sequence
When data carry out depth modelling, the fireballing feature of Transformer model running can be utilized, and guarantee the robust of model
Property, and bidirectional transducer model and more disaggregated models after two stages training, be conducive to obtain preferable text classification
As a result.
Specific embodiment
With reference to the accompanying drawing, the scheme provided this specification is described.
Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses.The implement scene is related to text classification,
And the training to the model for text analyzing.Referring to Fig.1, which is related to three class models: bidirectional transducer model
(also referred to as two-way Transformer model), language model and more disaggregated models (also referred to as multi-categorizer), to model into
When row training, multi-task learning can be formed with multiple model joint trainings.
Wherein, text classification: if text classification refers to the text classification by user's input into the Ganlei pre-defined
One or more classes task.
Language model: language model is to judge the sentence by calculating the probability that occurs in natural language of a sentence
No to belong to correct natural language, to information retrieval, machine translation, the tasks such as speech recognition have important role.Neural language
Speech model is one kind of language model, the probability that it occurs using each sentence of neural net model establishing.By from a large amount of corpus
Study, neural language model may learn the inherent laws and knowledge of language.
Multi-task learning: multi-task learning is a machine learning research field, it is intended to be put into a variety of relevant tasks
Combination learning in the same model or frame achievees the effect that knowledge migration improves each task between task.
As shown in Figure 1, the training of model is divided into two stages: pre-training stage and fine tuning stage.
In the pre-training stage, for the sentence S={ w being made of N number of word1,w2,…,wN, two-way Transformer model
S is converted into N number of vector { v first1,v2,…,vN, wherein each vector indicates the output vector of a word, the production of the vector
The raw contextual information for having fully considered each word.Then with the output vector v of each wordiPass through language model prediction present bit
The word w seti, to carry out model training to two-way Transformer model and language model based on prediction result.
Two-way Transformer model is also passed through by sentence S=using the text data for having mark in the fine tuning stage
{w1,w2,…,wNIt is converted into vector { v1,v2,…,vN, then by all word output vector { v1,v2,…,vNMean valueMake
Text classification is carried out by multi-categorizer using the expression vector of sentence as the input of multi-categorizer for the expression vector of sentence,
The stage point also is finely tuned by language model prediction current word using the output vector of each word as the input of language model simultaneously
Generic task and language model prediction task form multi-task learning, and the generalization ability of more disaggregated models can be improved.
In forecast period, for the sentence of user's input, by two-way Transformer output to after measuring mean value
To the expression vector of sentence, will classify in the expression vector input multi-categorizer of sentence.
In this specification embodiment, the two-way Transformer model of pre-training and language model first are two-way
Transformer model fully considers the contextual information of each word, rather than only considers information above.Then good with pre-training
Transformer model is finely tuned on text categorization task, to improve the robustness of model.
Fig. 2 shows the model training method flow chart for text analyzing according to one embodiment, which can be with
Corresponding to the pre-training stage mentioned in application scenarios shown in Fig. 1.As shown in Fig. 2, being used for the mould of text analyzing in the embodiment
Type training method the following steps are included:
First in step 21, using the first bidirectional transducer model, for each word in the first training sentence, being based on should
Above information of the initial term vector and the word of word in the first training sentence obtains the corresponding positive vector of the word.
It is understood that obtaining the process and unidirectional Transformer model class of the corresponding positive vector of each word in the step 21
Seemingly.
Then in step 22, using the first bidirectional transducer model, for each of described first training sentence
Word, the context information of initial term vector and the word based on the word in the first training sentence, it is corresponding to obtain the word
Opposite vector.It is understood that the word is utilized during obtaining the corresponding opposite vector of each word in the step 22
Context information.
In one example, using the first bidirectional transducer model, for each of described first training sentence
Word, using from attention mechanism, initial term vector and the word based on the word train the hereafter letter in sentence described first
Breath, extracts multiple important informations from different perspectives;The corresponding vector of important information each in the multiple important information is carried out
Splicing, obtains the corresponding opposite vector of the word.
Then in step 23, according to the position of each word in the first training sentence, by the previous word of the position
The opposite vector of the latter word of the positive vector sum position be stitched together, as the corresponding target term vector in the position.It can
With understanding, the corresponding target term vector in each position not only embodied the position before it is above, but also embody the position it
Afterwards hereafter, robustness is good.
Again in step 24, using first language model, for the corresponding target in position each in the first training sentence
Term vector, prediction obtain the first probability of the corresponding word in the position.It is understood that obtaining using bidirectional transducer model
The corresponding target term vector in each position in one training sentence, then the position is obtained according to target word vector forecasting by language model
The probability of corresponding word.
Finally in step 25, by making the first-loss Function Minimization with first probability correlation, to described first
Bidirectional transducer model and the first language model are trained, the second bidirectional transducer model and second after being trained
Language model.It is understood that the training sentence that model training uses is convenient for without artificial mark using widely without mark
Corpus is trained model.
The method provided by this specification embodiment, first with the first bidirectional transducer model, for the first training
Each word in sentence, the information above of initial term vector and the word based on the word in the first training sentence, obtains
To the corresponding positive vector of the word;Followed by the first bidirectional transducer model, in the first training sentence
Each word, the context information of initial term vector and the word based on the word in the first training sentence, obtains the word pair
The opposite vector answered;Then according to the position of each word in the first training sentence, just by the previous word of the position
It is stitched together to the opposite vector of the latter word of the vector sum position, as the corresponding target term vector in the position;It recycles
First language model, for the corresponding target term vector in position each in the first training sentence, prediction obtains the position pair
First probability of the word answered;Finally by the first-loss Function Minimization made with first probability correlation, to described first
Bidirectional transducer model and the first language model are trained, the second bidirectional transducer model and second after being trained
Language model.In this specification embodiment, different from common unidirectional Transformer model, bidirectional transducer model is abundant
The contextual information of each word is considered, rather than only considers information above, when carrying out depth modelling to sequence data, Neng Gouli
With the fireballing feature of Transformer model running, and guarantee the robustness of model.
Fig. 3 shows the model training method flow chart for text analyzing according to another embodiment, which can
To correspond to the fine tuning stage mentioned in application scenarios shown in Fig. 1.As shown in figure 3, model training method includes in the embodiment
Following steps:
First in step 31, using the second bidirectional transducer model after the training of method described in Fig. 2, for second
Each word in training sentence, the letter above of initial term vector and the word based on the word in the second training sentence
Breath obtains the corresponding positive vector of the word.It is understood that obtaining the corresponding positive vector of each word in the step 31
Process is similar with unidirectional Transformer model.
Then in step 32, using the second bidirectional transducer model, for each of described second training sentence
Word, the context information of initial term vector and the word based on the word in the second training sentence, it is corresponding to obtain the word
Opposite vector.It is understood that the word is utilized during obtaining the corresponding opposite vector of each word in the step 32
Context information.
In one example, using the second bidirectional transducer model, for each of described second training sentence
Word, using from attention mechanism, initial term vector and the word based on the word train the hereafter letter in sentence described second
Breath, extracts multiple important informations from different perspectives;The corresponding vector of important information each in the multiple important information is carried out
Splicing, obtains the corresponding opposite vector of the word.
Then in step 33, according to the position of each word in the second training sentence, by the previous word of the position
The opposite vector of the latter word of the positive vector sum position be stitched together, as the corresponding target term vector in the position.It can
With understanding, the corresponding target term vector in each position not only embodied the position before it is above, but also embody the position it
Afterwards hereafter, robustness is good.
Again in step 34, using the second language model after the training of method described in Fig. 2, for second training
The corresponding target term vector in each position in sentence, prediction obtain the first probability of the corresponding word in the position.It is understood that
Obtain the corresponding target term vector in each position in the second training sentence using bidirectional transducer model, then by language model according to
Target word vector forecasting obtains the probability of the corresponding word in the position.
Again in step 35, according to the corresponding target term vector in position each in the second training sentence, described the is generated
The expression vector of the corresponding sentence of two training sentences.It is understood that generating the corresponding sentence of the second training sentence
Indicate vector, combine the corresponding target term vector in multiple positions, rather than just the corresponding target word in one of position to
Amount.
In one example, the corresponding target term vector in position each in the second training sentence is taken into mean value, by institute
State expression vector of the mean value as the corresponding sentence of the second training sentence.
Again in step 36, using more disaggregated models, based on the expression vector of the corresponding sentence of the second training sentence, in advance
Survey the second probability of the second training sentence corresponding label.It is understood that label is the text classification marked in advance
Classification.
Finally in step 37, by make first-loss function and the second loss function and minimization, to described second pair
Be trained to converter model, the second language model and more disaggregated models, obtain third bidirectional transducer model,
Third language model and more than second disaggregated model;Wherein, the first-loss function and first probability correlation, described second
Loss function and second probability correlation.It is understood that the training sentence that model training uses need to be marked manually, it is convenient for
Using have mark corpus model is further trained.
In one example, by gradient descent method make the first-loss function with it is the second loss function and minimum
Change, with the model parameter of determination the second bidirectional transducer model, the second language model and more disaggregated models.
The method that this specification embodiment provides, first with described second two-way turn after the training of method described in Fig. 2
Parallel operation model, for each word in the second training sentence, initial term vector and the word based on the word are in second instruction
Practice the information above in sentence, obtains the corresponding positive vector of the word;Followed by the second bidirectional transducer model, for
Each word in the second training sentence, initial term vector and the word based on the word are in the second training sentence
Context information, obtain the corresponding opposite vector of the word;It then, will according to the position of each word in the second training sentence
The opposite vector of the latter word of the positive vector sum position of the previous word of the position is stitched together, corresponding as the position
Target term vector;The second language model after recycling the training of method described in Fig. 2, for the second training sentence
In the corresponding target term vector in each position, prediction obtains the first probability of the corresponding word in the position;And according to described second
The corresponding target term vector in each position in training sentence generates the expression vector of the corresponding sentence of the second training sentence;
Second training is predicted based on the expression vector of the corresponding sentence of the second training sentence followed by more disaggregated models
Second probability of sentence corresponding label;Finally by make first-loss function and the second loss function and minimization, to described
Second bidirectional transducer model, the second language model and more disaggregated models are trained, and obtain third bi-directional conversion
Device model, third language model and more than second disaggregated model;Wherein, the first-loss function and first probability correlation,
Second loss function and second probability correlation.In this specification embodiment, depth not only is being carried out to sequence data
When modeling, the fireballing feature of Transformer model running can be utilized, and guarantee the robustness of model;Moreover,
On the one hand on the basis of carrying out model training to bidirectional transducer model and language model, further to bidirectional transducer model,
Language model and more disaggregated models carry out joint training, reach better model training effect.
Fig. 4 shows the file classification method flow chart according to another embodiment, which can correspond to shown in Fig. 1
The forecast period mentioned in application scenarios.As shown in figure 4, in the embodiment file classification method the following steps are included:
First in step 41, using the third bidirectional transducer model after the training of method described in Fig. 3, for wait divide
Each word in quasi-sentence, the information above of initial term vector and the word in the sentence to be sorted based on the word, obtains
To the corresponding positive vector of the word.It is understood that in the step 41, obtain the process of the corresponding positive vector of each word with
Unidirectional Transformer model is similar.
Then in step 42, using the third bidirectional transducer model, for each word in the sentence to be sorted,
It is corresponding reversed to obtain the word for the context information of initial term vector and the word in the sentence to be sorted based on the word
Vector.It is understood that the word is utilized hereafter during obtaining the corresponding opposite vector of each word in the step 42
Information.
Then in step 43, according to the position of each word in the sentence to be sorted, by the previous word of the position
The opposite vector of the latter word of the positive vector sum position is stitched together, as the corresponding target term vector in the position.It can be with
Understand, the corresponding target term vector in each position not only embodied the position before it is above, but also after embodying the position
Hereafter, robustness is good.
It is generated described wait divide in step 44 according to the corresponding target term vector in position each in the sentence to be sorted again
The expression vector of the corresponding sentence of quasi-sentence.It is understood that generate the expression of the corresponding sentence of the sentence to be sorted to
Amount, combines the corresponding target term vector in multiple positions, rather than just the corresponding target term vector in one of position.
Finally in step 45, using more than second disaggregated model after the training of method described in Fig. 3, based on described wait divide
The expression vector of the corresponding sentence of quasi-sentence carries out text classification to the sentence to be sorted.It is understood that utilizing more points
Class model predicts that the sentence to be sorted corresponds to each classification based on the expression vector of the corresponding sentence of the sentence to be sorted
Probability, take the classification of maximum probability as the result of text classification.
The method that this specification embodiment provides, first with two-way turn of the third after the training of method described in Fig. 3
Parallel operation model, for each word in sentence to be sorted, initial term vector and the word based on the word are in the language to be sorted
Information above in sentence obtains the corresponding positive vector of the word;Followed by the third bidirectional transducer model, for described
Each word in sentence to be sorted, the hereafter letter of initial term vector and the word in the sentence to be sorted based on the word
Breath, obtains the corresponding opposite vector of the word;Then according to the position of each word in the sentence to be sorted, before the position
The opposite vector of the latter word of the positive vector sum position of one word is stitched together, as the corresponding target word in the position to
Amount;Further according to the corresponding target term vector in position each in the sentence to be sorted, the corresponding sentence of the sentence to be sorted is generated
The expression vector of son;Finally using more than second disaggregated model after the training of method described in Fig. 3, it is based on the language to be sorted
The expression vector of the corresponding sentence of sentence carries out text classification to the sentence to be sorted.In this specification embodiment, to sequence
When data carry out depth modelling, the fireballing feature of Transformer model running can be utilized, and guarantee the robust of model
Property, and bidirectional transducer model and more disaggregated models after two stages training, be conducive to obtain preferable text classification
As a result.
Three kinds of models involved in previous embodiment will specifically be introduced below: two-way Transformer model is (referred to as two-way
Transformer), language model and more disaggregated models (also referred to as multi-categorizer).
Wherein, two-way Transformer:
Then the working principle for introducing tradition Transformer (i.e. unidirectional Transformer) first is expanded to double
To on Transformer.
1, unidirectional Transformer
Transformer be Ashish Vaswani of Google et al. propose, for by text sequence be converted into
Amount, Transformer overcome LSTM processing text when need by word calculate the shortcomings that, by each word on it text in make
Information above is obtained with attention mechanism (attention mechanism).In this process, the output vector of each word
Calculating can be parallel.
Fig. 5 is the schematic diagram of internal structure for the unidirectional Transformer model that this specification embodiment provides, referring to Fig. 5:
The input of Transformer module is a sequence vector X={ x1,x2,…,xN, wherein xiFor i-th of position
Expression vector.X first passes around bull from noticing that power module (multi-head self attention) makes each word and thereon
Each word generates interaction in text, to each word plus its above associated important information.Bull from pay attention to power module by
Multiple structures are identical from power module composition is paid attention to, wherein each notice that the calculating process of power module is as follows certainly:
Use entirely connected layer (feed forward) by each word x firstiIt is converted into two vector kiAnd ti:
ki=tanh (Wqxi+b)
ti=tanh (Wvxi+b)
WqAnd WvFor parameter trainable in model.kiFor calculating xiAll the above word { x1,…,xi-1For xiWeight
The property wanted, tiThen for storing xiIn information, be supplied to other words use:
Obtained vector ciAs extract from the above to xiUseful important information.Even if bull is from attention mechanism
With multiple above-mentioned attention power modules, each word x is given from different perspectivesiFrom { x above1,…,xi-1In extract important information.Finally
The vector that all attention power modules are extracted to each word is spliced into di, as bull is from attention power module for the defeated of each word
Outgoing vector.
The output vector d of each wordiUsing normalization layer (layer normalization)-entirely be connected layer (feed
Forward output vector l is obtained after)-normalization layeri, as xiBy the vector after the conversion of Transformer module, meter
Calculation process is as follows:
li=LayerNorm (LayerNorm (xi+di)+W·LayerNorm(xi+di)
Wherein W is trainable parameter, and LayerNorm is used to normalize one layer of neural network, make between layer
Information flow is more stable.LayerNorm calculation is as follows:
Wherein μ is the mean value of all neurons in one layer of neural network, and σ is the mark of all neurons in one layer of neural network
It is quasi- poor.
Transformer module can carry out multiple-level stack, and the output of next layer of Transformer is as upper one layer
The input of Transformer, to form multilayer Transformer network.The calculating process of Transformer network can be with table
It is shown as
2, two-way Transformer
Two-way Transformer is the extension of unidirectional Transformer, is embodied in unidirectional Transformer in attention
Information above is only considered in mechanism, has ignored context information.And the information hereafter of each word hereafter is also useful to itself
's.Therefore two-way Transformer models sentence from context both direction, increases the ability to express of model.It was calculated
Journey is as follows:
WhereinWithRespectively indicate from it is upper and lower text in extract to xiImportant information.Utilize two-way Transformer
Obtain diLater, as unidirectional Transformer,WithAlso it is obtained respectively by normalization layer and the full layer that is connected two-way
Transformer is for xiTwo output vectors.After multi-layer biaxially oriented Transformer, the last one sentence S=
{w1,w2,…,wNIt is converted into two groups of vectorsWithThe calculating of two-way Transformer
Journey can be expressed as
Wherein, language model:
One sentence S={ w1,w2,…,wNBy being converted into two groups of vectors after multi-layer biaxially oriented TransformerWithIt can be by language model task come pre-training model, because of language model task
Data do not need to mark, therefore are very easy to obtain mass data to the abundant pre-training of model progress.The mesh of language model task
Be by each word xiContext { x1,…,xi-1,xi+1,…,xNPredict xi, so that model learning is to the interior of natural language
In rule, if as soon as model can by context come it is correctly predicted go out each word, then this model is well
The inherent law of natural language is acquired.The calculating process of two-way Transformer language model is as follows:
First by the positive vector of (i-1)-th wordWith the opposite vector of i+1 wordIt is stitched together
Then v is usediPredict i-th of word wiProbability:
Wherein WLMIt is trainable parameter, W in language modelj LMIndicate WLMJth row.
The loss function of language model is the mean value of the cross entropy loss function of all words in sentence
The purpose of language model is to want minimization LLM, remember in two-way Transformer it is all can training parameter collection be combined into W,
In language model it is all can training parameter collection be combined into WLM.W and WLMOptimization is iterated by gradient descent method:
By iteration optimization model when pre-training, until LLMThe threshold value beta set less than one (can usually take 0.1,0.01
Deng), model just trains, and model has acquired the inherent law of natural language at this time.γ1Usually take the reality of 0.0001 magnitude
Number.
Wherein, multi-categorizer:
Fine tuning the stage, using have the data of true tag to Jing Guo pre-training two-way Transformer model progress
Fine tuning.Because two-way Transformer passes through pre-training, the knowledge of natural language is grasped, therefore first compared to random
Beginningization model, the training directly on having label data can reach better effect by the fine tuning of pre-training bonus point class.
Trim process includes two parts, and a part is language model portion identical with pre-training process, and another part is
To the sentence (S={ w of each input1,w2,…,wN, l) classify, wherein l is that the labeling process of sentence is as follows:
First with two-way Transformer by S={ w1,w2,…,wNIt is converted into vector [v1,…,vN], wherein it is each to
Amount is all the expression vector of corresponding position word.Then to the expression of all words to measuring the obtained vector of mean value as entire sentence
Expression vector.
Then Softmax classifier pair is usedCalculate the probability that S belongs to each label:
Wherein WcFor parameter sets trainable in multi-categorizer, Wk cIndicate WcIn row k.The loss function of classifier
Belong to the cross entropy of its true tag l for each sample, i.e.,
LC=-logpC(l|S)
In trim process, the purpose of model is the loss function L by minimization language modelLMWith the loss letter of classifier
Number LCThe sum of L=LLM+LC, all parameters in model by gradient descent method iteration optimization,
γ2Value usually compare γ1A small magnitude is 0.00001 or so.
In forecast period, by only needing two-way Transformer that sentence S is converted into vectorThen multi-categorizer is used
It calculates S and belongs to each label lkProbability, finally take the label of maximum probability to export.
L=argmaxkpc(lk|S)
So far, it is achieved that the effect that textual classification model is improved using two-way Transformer.
It should be noted that multitask classifier and task arbiter are all not limited only to softmax classifier, it is all can be into
The model of row classification can serve as multitask classifier and task arbiter, such as support vector machines, logistic regression, multilayer mind
Through network etc..
According to the embodiment of another aspect, a kind of model training apparatus for text analyzing is also provided, described device is used
In the model training method for text analyzing for executing the offer of this specification embodiment, for example, shown in Fig. 2 for text point
The model training method of analysis.Fig. 6 shows the schematic frame of the model training apparatus for text analyzing according to one embodiment
Figure.As shown in fig. 6, the device 600 includes:
Positive vector generation unit 61, for utilizing the first bidirectional transducer model, for every in the first training sentence
A word, the information above of initial term vector and the word based on the word in the first training sentence, it is corresponding to obtain the word
Positive vector;
Opposite vector generation unit 62, for utilizing the first bidirectional transducer model, for the first training language
Each word in sentence, the context information of initial term vector and the word based on the word in the first training sentence, obtains
The corresponding opposite vector of the word;
Term vector generation unit 63, for the position according to each word in the first training sentence, by the forward direction
What opposite vector generation unit 62 described in the positive vector sum of the previous word for the position that vector generation unit 61 obtains obtained
The opposite vector of the latter word of the position is stitched together, as the corresponding target term vector in the position;
Predicting unit 64, for utilizing first language model, obtained for the term vector generation unit 63 described the
The corresponding target term vector in each position in one training sentence, prediction obtain the first probability of the corresponding word in the position;
Model training unit 65, for the first-loss by making the first probability correlation obtained with the predicting unit 64
Function Minimization is trained the first bidirectional transducer model and the first language model, and after being trained
Two bidirectional transducer models and second language model.
Optionally, as one embodiment, the opposite vector generation unit 62 is specifically used for:
Attention certainly is used for each word in the first training sentence using the first bidirectional transducer model
Power mechanism, the context information of initial term vector and the word based on the word in the first training sentence, from different perspectives
Extract multiple important informations;
The corresponding vector of important information each in the multiple important information is spliced, it is corresponding reversed to obtain the word
Vector.
The device provided by this specification embodiment, vector generation unit 61 positive first utilize the first bidirectional transducer
Model, for each word in the first training sentence, initial term vector and the word based on the word are in the first training language
Information above in sentence obtains the corresponding positive vector of the word;Then opposite vector generation unit 62 is two-way using described first
Converter model, for each word in the first training sentence, initial term vector and the word based on the word are described
Context information in first training sentence, obtains the corresponding opposite vector of the word;Then term vector generation unit 63 is according to described
The position of each word in first training sentence, by the latter word of the positive vector sum position of the previous word of the position
Opposite vector is stitched together, as the corresponding target term vector in the position;Predicting unit 64 recycles first language model, for
The corresponding target term vector in each position in the first training sentence, prediction obtain the first probability of the corresponding word in the position;
Last model training unit 65 is two-way to described first by making the first-loss Function Minimization with first probability correlation
Converter model and the first language model are trained, the second bidirectional transducer model and second language after being trained
Model.In this specification embodiment, different from common unidirectional Transformer model, bidirectional transducer model is fully considered
The contextual information of each word, rather than only consider information above, when carrying out depth modelling to sequence data, can utilize
The fireballing feature of Transformer model running, and guarantee the robustness of model.
According to the embodiment of another aspect, a kind of model training apparatus for text analyzing is also provided, described device is used
In the model training method for text analyzing for executing the offer of this specification embodiment, for example, shown in Fig. 3 for text point
The model training method of analysis.Fig. 7 shows the schematic frame of the model training apparatus for text analyzing according to one embodiment
Figure.As shown in fig. 7, the device 700 includes:
Positive vector generation unit 71, for utilizing the second bidirectional transducer mould after the training of method described in Fig. 2
Type, for each word in the second training sentence, initial term vector and the word based on the word are in the second training sentence
In information above, obtain the corresponding positive vector of the word;
Opposite vector generation unit 72, for utilizing the second bidirectional transducer model, for the second training language
Each word in sentence, the context information of initial term vector and the word based on the word in the second training sentence, obtains
The corresponding opposite vector of the word;
Term vector generation unit 73, for the position according to each word in the second training sentence, by the forward direction
What opposite vector generation unit 72 described in the positive vector sum of the previous word for the position that vector generation unit 71 obtains obtained
The opposite vector of the latter word of the position is stitched together, as the corresponding target term vector in the position;
First predicting unit 74, for the second language model after being trained using method described in Fig. 2, for described
The corresponding target term vector in each position in second training sentence, prediction obtain the first probability of the corresponding word in the position;
Sentence vector generation unit 75, the second training sentence for being obtained according to the term vector generation unit 73
In the corresponding target term vector in each position, generate the expression vector of the corresponding sentence of the second training sentence;
Second predicting unit 76, for utilizing more disaggregated models, the institute obtained based on the sentence vector generation unit 75
The expression vector for stating the corresponding sentence of the second training sentence predicts the second probability of the second training sentence corresponding label;
Model training unit 77, for by make first-loss function and the second loss function and minimization, to described
Second bidirectional transducer model, the second language model and more disaggregated models are trained, and obtain third bi-directional conversion
Device model, third language model and more than second disaggregated model;Wherein, the first-loss function and first probability correlation,
Second loss function and second probability correlation.
Optionally, as one embodiment, the opposite vector generation unit 72 is specifically used for:
Attention certainly is used for each word in the second training sentence using the second bidirectional transducer model
Power mechanism, the context information of initial term vector and the word based on the word in the second training sentence, from different perspectives
Extract multiple important informations;
The corresponding vector of important information each in the multiple important information is spliced, it is corresponding reversed to obtain the word
Vector.
Optionally, as one embodiment, the sentence vector generation unit 75 is specifically used for the second training language
The corresponding target term vector in each position takes mean value in sentence, using the mean value as the corresponding sentence of the second training sentence
Indicate vector.
Optionally, as one embodiment, the model training unit 77 is described specifically for being made by gradient descent method
First-loss function and the second loss function and minimization, with determination the second bidirectional transducer model, second language
Say the model parameter of model and more disaggregated models.
The device that this specification embodiment provides, vector generation unit 71 positive first utilize the training of method described in Fig. 2
The second bidirectional transducer model afterwards, for each word in the second training sentence, the initial term vector based on the word, with
And above information of the word in the second training sentence, obtain the corresponding positive vector of the word;Then opposite vector generates
Unit 72 utilizes the second bidirectional transducer model, first based on the word for each word in the second training sentence
The context information of beginning term vector and the word in the second training sentence, obtains the corresponding opposite vector of the word;Then word
Vector generation unit 73 according to the position of each word in the second training sentence, by the previous word of the position it is positive to
The opposite vector of the latter word of amount and the position is stitched together, as the corresponding target term vector in the position;First prediction is single
Member 74 recycles the second language model after the training of method described in Fig. 2, for each position in the second training sentence
Corresponding target term vector is set, prediction obtains the first probability of the corresponding word in the position;And sentence vector generation unit 75
According to the corresponding target term vector in position each in the second training sentence, the corresponding sentence of the second training sentence is generated
Indicate vector;Then the second predicting unit 76 utilizes more disaggregated models, the table based on the corresponding sentence of the second training sentence
Show vector, predicts the second probability of the second training sentence corresponding label;Last model training unit 77 is by making the first damage
Lose function and the second loss function and minimization, to the second bidirectional transducer model, the second language model and institute
It states more disaggregated models to be trained, obtains third bidirectional transducer model, third language model and more than second disaggregated model;Its
In, the first-loss function and first probability correlation, second loss function and second probability correlation.This theory
It, can be fast using Transformer model running speed not only when carrying out depth modelling to sequence data in bright book embodiment
The characteristics of, and guarantee the robustness of model;Moreover, carrying out model to bidirectional transducer model and language model in first aspect
On the basis of training, joint training further is carried out to bidirectional transducer model, language model and more disaggregated models, reaches more preferable
Model training effect.
According to the embodiment of another aspect, a kind of document sorting apparatus is also provided, described device is for executing this specification
The file classification method that embodiment provides, for example, file classification method shown in Fig. 4.Fig. 8 shows the text according to one embodiment
The schematic block diagram of this sorter.As shown in figure 8, the device 800 includes:
Positive vector generation unit 81, for utilizing the third bidirectional transducer mould after the training of method described in Fig. 3
Type, for each word in sentence to be sorted, initial term vector and the word based on the word are in the sentence to be sorted
Information above obtains the corresponding positive vector of the word;
Opposite vector generation unit 82, for utilizing the third bidirectional transducer model, for the sentence to be sorted
In each word, the context information of initial term vector and the word in the sentence to be sorted based on the word obtains the word
Corresponding opposite vector;
Term vector generation unit 83, for the position according to each word in the sentence to be sorted, by it is described it is positive to
What opposite vector generation unit 82 described in the positive vector sum of the previous word for the position that amount generation unit 81 obtains obtained should
The opposite vector of the latter word of position is stitched together, as the corresponding target term vector in the position;
Sentence vector generation unit 84, in the sentence to be sorted for being obtained according to the term vector generation unit 83
The corresponding target term vector in each position generates the expression vector of the corresponding sentence of the sentence to be sorted;
Text classification unit 85, for being based on institute using more than second disaggregated model after the training of method described in Fig. 3
The expression vector for the corresponding sentence of the sentence to be sorted for stating that sentence vector generation unit 84 obtains, to the sentence to be sorted
Carry out text classification.
The device that this specification embodiment provides, vector generation unit 81 positive first utilize the training of method described in Fig. 3
The third bidirectional transducer model afterwards, for each word in sentence to be sorted, the initial term vector based on the word, and
Above information of the word in the sentence to be sorted obtains the corresponding positive vector of the word;Then opposite vector generation unit
82 utilize the third bidirectional transducer models, for each word in the sentence to be sorted, based on the initial word of the word to
The context information of amount and the word in the sentence to be sorted, obtains the corresponding opposite vector of the word;Then term vector generates
Unit 83 is according to the position of each word in the sentence to be sorted, by the positive vector sum of the previous word of the position position
The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position;Sentence vector generation unit 84 is again
According to the corresponding target term vector in position each in the sentence to be sorted, the table of the corresponding sentence of the sentence to be sorted is generated
Show vector;Last text classification unit 85 is using more than second disaggregated model after the training of method described in Fig. 3, based on described
The expression vector of the corresponding sentence of sentence to be sorted carries out text classification to the sentence to be sorted.In this specification embodiment,
When carrying out depth modelling to sequence data, the fireballing feature of Transformer model running can be utilized, and guarantee mould
The robustness of type, and bidirectional transducer model and more disaggregated models after two stages training, are conducive to obtain preferably
Text classification result.
According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey
Sequence enables computer execute and combines method described in Fig. 2 to Fig. 4 when the computer program executes in a computer.
According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided
In be stored with executable code, when the processor executes the executable code, realize and combine side described in Fig. 2 to Fig. 4
Method.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention
It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions
Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all
Including within protection scope of the present invention.