CN110046248A

CN110046248A - Model training method, file classification method and device for text analyzing

Info

Publication number: CN110046248A
Application number: CN201910176632.6A
Authority: CN
Inventors: 蒋亮; 张家兴; 温祖杰; 梁忠平
Original assignee: Alibaba Group Holding Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2019-07-23
Anticipated expiration: 2039-03-08
Also published as: CN110046248B

Abstract

This specification embodiment provides a kind of model training method, file classification method and device for text analyzing, method includes: first with the first bidirectional transducer model, for each word in the first training sentence, initial term vector based on the word, and the information above of the word, obtain the corresponding positive vector of the word；Followed by the first bidirectional transducer model, for each word in the first training sentence, the context information of initial term vector and the word based on the word obtains the corresponding opposite vector of the word；Then according to the position of each word in the first training sentence, the opposite vector of the latter word of the positive vector sum position of the previous word of the position is stitched together, as the corresponding target term vector in the position；It recycles first language model to be trained for the corresponding target term vector in each position to the first bidirectional transducer model and first language model, so that not only the speed of service was fast, but also can guarantee the robustness of model.

Description

Model training method, file classification method and device for text analyzing

Technical field

This specification one or more embodiment is related to computer field, the more particularly, to model training of text analyzing Method, file classification method and device.

Background technique

Converter (Transformer) model is one kind that Ashish Vaswani of Google et al. was proposed in 2017 Neural network model can be used for the depth modelling of sequence data, alternative length memory network (long short term Memory, LSTM) model, have the characteristics that the speed of service is fast.

Transformer model only considers front institute only from unidirectional processing sequence in processing sequence when each position There is the information of position, do not account for the information of back location, this greatly limits the robustness of model.

Accordingly, it would be desirable to there is improved plan, when carrying out depth modelling to sequence data, can utilize The fireballing feature of Transformer model running, and guarantee the robustness of model.

Summary of the invention

This specification one or more embodiment describes a kind of model training method for text analyzing, text classification Method and apparatus can utilize the fireballing spy of Transformer model running when carrying out depth modelling to sequence data Point, and guarantee the robustness of model.

In a first aspect, providing a kind of model training method for text analyzing, method includes:

Using the first bidirectional transducer model, for each word in the first training sentence, based on the initial word of the word to The information above of amount and the word in the first training sentence obtains the corresponding positive vector of the word；

Using the first bidirectional transducer model, for each word in the first training sentence, based on the word The context information of initial term vector and the word in the first training sentence, obtains the corresponding opposite vector of the word；

It, should by the positive vector sum of the previous word of the position according to the position of each word in the first training sentence The opposite vector of the latter word of position is stitched together, as the corresponding target term vector in the position；

Using first language model, for the corresponding target term vector in position each in the first training sentence, prediction Obtain the first probability of the corresponding word in the position；

By making the first-loss Function Minimization with first probability correlation, to the first bidirectional transducer model It is trained with the first language model, the second bidirectional transducer model and second language model after being trained.

It is described to utilize the first bidirectional transducer model in a kind of possible embodiment, for first instruction Practice each word in sentence, initial term vector and the word based on the word train the context information in sentence described first, Obtain the corresponding opposite vector of the word, comprising:

Attention certainly is used for each word in the first training sentence using the first bidirectional transducer model Power mechanism, the context information of initial term vector and the word based on the word in the first training sentence, from different perspectives Extract multiple important informations；

The corresponding vector of important information each in the multiple important information is spliced, it is corresponding reversed to obtain the word Vector.

Second aspect, provides a kind of model training method for text analyzing, and method includes:

Using the second bidirectional transducer model after method training as described in relation to the first aspect, for the second training language Each word in sentence, the information above of initial term vector and the word based on the word in the second training sentence, obtains The corresponding positive vector of the word；

Using the second bidirectional transducer model, for each word in the second training sentence, based on the word The context information of initial term vector and the word in the second training sentence, obtains the corresponding opposite vector of the word；

It, should by the positive vector sum of the previous word of the position according to the position of each word in the second training sentence The opposite vector of the latter word of position is stitched together, as the corresponding target term vector in the position；

Using the second language model after method training as described in relation to the first aspect, for the second training sentence In the corresponding target term vector in each position, prediction obtains the first probability of the corresponding word in the position；And according to described second The corresponding target term vector in each position in training sentence generates the expression vector of the corresponding sentence of the second training sentence；

Using more disaggregated models, based on the expression vector of the corresponding sentence of the second training sentence, prediction described second Second probability of training sentence corresponding label；

By make first-loss function and the second loss function and minimization, to the second bidirectional transducer model, The second language model and more disaggregated models are trained, and obtain third bidirectional transducer model, third language model With more than second disaggregated model；Wherein, the first-loss function and first probability correlation, second loss function and institute State the second probability correlation.

It is described to utilize the second bidirectional transducer model in a kind of possible embodiment, for second instruction Practice each word in sentence, initial term vector and the word based on the word train the context information in sentence described second, Obtain the corresponding opposite vector of the word, comprising:

Attention certainly is used for each word in the second training sentence using the second bidirectional transducer model Power mechanism, the context information of initial term vector and the word based on the word in the second training sentence, from different perspectives Extract multiple important informations；

It is described according to the corresponding target word in position each in the second training sentence in a kind of possible embodiment Vector generates the expression vector of the corresponding sentence of the second training sentence, comprising:

The corresponding target term vector in position each in the second training sentence is taken into mean value, using the mean value as described in The expression vector of the corresponding sentence of second training sentence.

It is described to pass through make the first-loss function and the second loss function and pole in a kind of possible embodiment Smallization is trained the second bidirectional transducer model, the second language model and more disaggregated models, comprising:

Make the first-loss function and the second loss function by gradient descent method and minimization, with determination described The model parameter of two bidirectional transducer models, the second language model and more disaggregated models.

The third aspect, provides a kind of file classification method, and method includes:

Using the third bidirectional transducer model after the method training as described in second aspect, for sentence to be sorted In each word, the information above of initial term vector and the word in the sentence to be sorted based on the word obtains the word Corresponding forward direction vector；

It is first based on the word for each word in the sentence to be sorted using the third bidirectional transducer model The context information of beginning term vector and the word in the sentence to be sorted, obtains the corresponding opposite vector of the word；

According to the position of each word in the sentence to be sorted, by the positive vector sum of the previous word of the position position The opposite vector for the latter word set is stitched together, as the corresponding target term vector in the position；

According to the corresponding target term vector in position each in the sentence to be sorted, it is corresponding to generate the sentence to be sorted The expression vector of sentence；

More than second disaggregated model as described in after being trained using the method as described in second aspect is based on the sentence to be sorted The expression vector of corresponding sentence carries out text classification to the sentence to be sorted.

Fourth aspect, provides a kind of model training apparatus for text analyzing, and device includes:

Positive vector generation unit, for utilizing the first bidirectional transducer model, for each of first training sentence Word, the information above of initial term vector and the word based on the word in the first training sentence, it is corresponding to obtain the word Positive vector；

Opposite vector generation unit, for utilizing the first bidirectional transducer model, for the first training sentence In each word, initial term vector and the word based on the word it is described first training sentence in context information, be somebody's turn to do The corresponding opposite vector of word；

Term vector generation unit, for the position according to each word in the first training sentence, by it is described it is positive to The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that amount generation unit obtains obtains The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position；

Predicting unit is used to utilize first language model, first instruction obtained for the term vector generation unit Practice the corresponding target term vector in each position in sentence, prediction obtains the first probability of the corresponding word in the position；

Model training unit, for the first-loss function by making the first probability correlation obtained with the predicting unit Minimization is trained the first bidirectional transducer model and the first language model, second after being trained pair To converter model and second language model.

5th aspect, provides a kind of model training apparatus for text analyzing, device includes:

Positive vector generation unit, for utilizing second bi-directional conversion after method training as described in relation to the first aspect Device model, for each word in the second training sentence, initial term vector and the word based on the word are in second training Information above in sentence obtains the corresponding positive vector of the word；

Opposite vector generation unit, for utilizing the second bidirectional transducer model, for the second training sentence In each word, initial term vector and the word based on the word it is described second training sentence in context information, be somebody's turn to do The corresponding opposite vector of word；

Term vector generation unit, for the position according to each word in the second training sentence, by it is described it is positive to The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that amount generation unit obtains obtains The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position；

First predicting unit, for utilizing the second language model after method training as described in relation to the first aspect, needle Target term vector corresponding to position each in the second training sentence, predicts that obtain the corresponding word in the position first is general Rate；

Sentence vector generation unit, described second for being obtained according to the term vector generation unit trains in sentence often The corresponding target term vector in a position generates the expression vector of the corresponding sentence of the second training sentence；

Second predicting unit, for utilizing more disaggregated models, obtained based on the sentence vector generation unit described the The expression vector of the corresponding sentence of two training sentences, predicts the second probability of the second training sentence corresponding label；

Model training unit, for by make first-loss function and the second loss function and minimization, to described Two bidirectional transducer models, the second language model and more disaggregated models are trained, and obtain third bidirectional transducer Model, third language model and more than second disaggregated model；Wherein, the first-loss function and first probability correlation, institute State the second loss function and second probability correlation.

6th aspect, provides a kind of document sorting apparatus, device includes:

Positive vector generation unit, for utilizing the third bi-directional conversion after the method training as described in second aspect Device model, for each word in sentence to be sorted, initial term vector and the word based on the word are in the sentence to be sorted In information above, obtain the corresponding positive vector of the word；

Opposite vector generation unit, for utilizing the third bidirectional transducer model, in the sentence to be sorted Each word, the context information of initial term vector and the word in the sentence to be sorted based on the word obtains the word pair The opposite vector answered；

Term vector generation unit, for the position according to each word in the sentence to be sorted, by the positive vector The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that generation unit obtains obtains The opposite vector of the latter word is stitched together, as the corresponding target term vector in the position；

Sentence vector generation unit, it is each in the sentence to be sorted for being obtained according to the term vector generation unit The corresponding target term vector in position generates the expression vector of the corresponding sentence of the sentence to be sorted；

Text classification unit, more than second disaggregated model as described in after being trained for method of the utilization as described in second aspect, Expression vector based on the corresponding sentence of the sentence to be sorted that the sentence vector generation unit obtains, to described to be sorted Sentence carries out text classification.

7th aspect, provides a kind of computer readable storage medium, is stored thereon with computer program, when the calculating When machine program executes in a computer, the method that enables computer execute first aspect or second aspect or the third aspect.

Eighth aspect provides a kind of calculating equipment, including memory and processor, and being stored in the memory can hold Line code, when the processor executes the executable code, the method for realization first aspect or second aspect or the third aspect.

The method and apparatus provided by this specification embodiment, on the one hand, first with the first bidirectional transducer model, For each word in the first training sentence, initial term vector and the word based on the word are in the first training sentence Information above, obtain the corresponding positive vector of the word；Followed by the first bidirectional transducer model, for described first Each word in training sentence, the hereafter letter of initial term vector and the word based on the word in the first training sentence Breath, obtains the corresponding opposite vector of the word；Then according to the position of each word in the first training sentence, by the position The opposite vector of the latter word of the positive vector sum position of previous word is stitched together, as the corresponding target word in the position Vector；First language model is recycled, for the corresponding target term vector in position each in the first training sentence, is measured in advance To the first probability of the corresponding word in the position；Finally by the first-loss Function Minimization made with first probability correlation, The first bidirectional transducer model and the first language model are trained, the second bidirectional transducer after being trained Model and second language model.It is different from common unidirectional Transformer model in this specification embodiment, bi-directional conversion Device model has fully considered the contextual information of each word, rather than only considers information above, builds carrying out depth to sequence data When mould, the fireballing feature of Transformer model running can be utilized, and guarantee the robustness of model.

On the other hand, the second bidirectional transducer model after being trained first with method as described in relation to the first aspect, For each word in the second training sentence, initial term vector and the word based on the word are in the second training sentence Information above, obtain the corresponding positive vector of the word；Followed by the second bidirectional transducer model, for described second Each word in training sentence, the hereafter letter of initial term vector and the word based on the word in the second training sentence Breath, obtains the corresponding opposite vector of the word；Then according to the position of each word in the second training sentence, by the position The opposite vector of the latter word of the positive vector sum position of previous word is stitched together, as the corresponding target word in the position Vector；The second language model after recycling method training as described in relation to the first aspect, for the second training sentence In the corresponding target term vector in each position, prediction obtains the first probability of the corresponding word in the position；And according to described second The corresponding target term vector in each position in training sentence generates the expression vector of the corresponding sentence of the second training sentence； Second training is predicted based on the expression vector of the corresponding sentence of the second training sentence followed by more disaggregated models Second probability of sentence corresponding label；Finally by make first-loss function and the second loss function and minimization, to described Second bidirectional transducer model, the second language model and more disaggregated models are trained, and obtain third bi-directional conversion Device model, third language model and more than second disaggregated model；Wherein, the first-loss function and first probability correlation, Second loss function and second probability correlation.In this specification embodiment, depth not only is being carried out to sequence data When modeling, the fireballing feature of Transformer model running can be utilized, and guarantee the robustness of model；Moreover, On the one hand on the basis of carrying out model training to bidirectional transducer model and language model, further to bidirectional transducer model, Language model and more disaggregated models carry out joint training, reach better model training effect.

Another aspect, the third bidirectional transducer model after being trained first with the method as described in second aspect, For each word in sentence to be sorted, initial term vector and the word based on the word are upper in the sentence to be sorted Literary information obtains the corresponding positive vector of the word；Followed by the third bidirectional transducer model, for the language to be sorted Each word in sentence, the context information of initial term vector and the word in the sentence to be sorted based on the word are somebody's turn to do The corresponding opposite vector of word；Then according to the position of each word in the sentence to be sorted, by the previous word of the position The opposite vector of the latter word of the positive vector sum position is stitched together, as the corresponding target term vector in the position；Root again According to the corresponding target term vector in position each in the sentence to be sorted, the expression of the corresponding sentence of the sentence to be sorted is generated Vector；Finally using more than second disaggregated model after the method training as described in second aspect, it is based on the language to be sorted The expression vector of the corresponding sentence of sentence carries out text classification to the sentence to be sorted.In this specification embodiment, to sequence When data carry out depth modelling, the fireballing feature of Transformer model running can be utilized, and guarantee the robust of model Property, and bidirectional transducer model and more disaggregated models after two stages training, be conducive to obtain preferable text classification As a result.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.

Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses；

Fig. 2 shows the model training method flow charts for text analyzing according to one embodiment；

Fig. 3 shows the model training method flow chart for text analyzing according to another embodiment；

Fig. 4 shows the file classification method flow chart according to another embodiment；

Fig. 5 is the schematic diagram of internal structure for the unidirectional Transformer model that this specification embodiment provides；

Fig. 6 shows the schematic block diagram of the model training apparatus for text analyzing according to one embodiment；

Fig. 7 shows the schematic block diagram of the model training apparatus for text analyzing according to another embodiment；

Fig. 8 shows the schematic block diagram of the document sorting apparatus according to another embodiment.

Specific embodiment

With reference to the accompanying drawing, the scheme provided this specification is described.

Fig. 1 is the implement scene schematic diagram of one embodiment that this specification discloses.The implement scene is related to text classification, And the training to the model for text analyzing.Referring to Fig.1, which is related to three class models: bidirectional transducer model (also referred to as two-way Transformer model), language model and more disaggregated models (also referred to as multi-categorizer), to model into When row training, multi-task learning can be formed with multiple model joint trainings.

Wherein, text classification: if text classification refers to the text classification by user's input into the Ganlei pre-defined One or more classes task.

Language model: language model is to judge the sentence by calculating the probability that occurs in natural language of a sentence No to belong to correct natural language, to information retrieval, machine translation, the tasks such as speech recognition have important role.Neural language Speech model is one kind of language model, the probability that it occurs using each sentence of neural net model establishing.By from a large amount of corpus Study, neural language model may learn the inherent laws and knowledge of language.

Multi-task learning: multi-task learning is a machine learning research field, it is intended to be put into a variety of relevant tasks Combination learning in the same model or frame achievees the effect that knowledge migration improves each task between task.

As shown in Figure 1, the training of model is divided into two stages: pre-training stage and fine tuning stage.

In the pre-training stage, for the sentence S={ w being made of N number of word₁,w₂,…,w_N, two-way Transformer model S is converted into N number of vector { v first₁,v₂,…,v_N, wherein each vector indicates the output vector of a word, the production of the vector The raw contextual information for having fully considered each word.Then with the output vector v of each word_iPass through language model prediction present bit The word w set_i, to carry out model training to two-way Transformer model and language model based on prediction result.

Two-way Transformer model is also passed through by sentence S=using the text data for having mark in the fine tuning stage {w₁,w₂,…,w_NIt is converted into vector { v₁,v₂,…,v_N, then by all word output vector { v₁,v₂,…,v_NMean valueMake Text classification is carried out by multi-categorizer using the expression vector of sentence as the input of multi-categorizer for the expression vector of sentence, The stage point also is finely tuned by language model prediction current word using the output vector of each word as the input of language model simultaneously Generic task and language model prediction task form multi-task learning, and the generalization ability of more disaggregated models can be improved.

In forecast period, for the sentence of user's input, by two-way Transformer output to after measuring mean value To the expression vector of sentence, will classify in the expression vector input multi-categorizer of sentence.

In this specification embodiment, the two-way Transformer model of pre-training and language model first are two-way Transformer model fully considers the contextual information of each word, rather than only considers information above.Then good with pre-training Transformer model is finely tuned on text categorization task, to improve the robustness of model.

Fig. 2 shows the model training method flow chart for text analyzing according to one embodiment, which can be with Corresponding to the pre-training stage mentioned in application scenarios shown in Fig. 1.As shown in Fig. 2, being used for the mould of text analyzing in the embodiment Type training method the following steps are included:

First in step 21, using the first bidirectional transducer model, for each word in the first training sentence, being based on should Above information of the initial term vector and the word of word in the first training sentence obtains the corresponding positive vector of the word. It is understood that obtaining the process and unidirectional Transformer model class of the corresponding positive vector of each word in the step 21 Seemingly.

Then in step 22, using the first bidirectional transducer model, for each of described first training sentence Word, the context information of initial term vector and the word based on the word in the first training sentence, it is corresponding to obtain the word Opposite vector.It is understood that the word is utilized during obtaining the corresponding opposite vector of each word in the step 22 Context information.

In one example, using the first bidirectional transducer model, for each of described first training sentence Word, using from attention mechanism, initial term vector and the word based on the word train the hereafter letter in sentence described first Breath, extracts multiple important informations from different perspectives；The corresponding vector of important information each in the multiple important information is carried out Splicing, obtains the corresponding opposite vector of the word.

Then in step 23, according to the position of each word in the first training sentence, by the previous word of the position The opposite vector of the latter word of the positive vector sum position be stitched together, as the corresponding target term vector in the position.It can With understanding, the corresponding target term vector in each position not only embodied the position before it is above, but also embody the position it Afterwards hereafter, robustness is good.

Again in step 24, using first language model, for the corresponding target in position each in the first training sentence Term vector, prediction obtain the first probability of the corresponding word in the position.It is understood that obtaining using bidirectional transducer model The corresponding target term vector in each position in one training sentence, then the position is obtained according to target word vector forecasting by language model The probability of corresponding word.

Finally in step 25, by making the first-loss Function Minimization with first probability correlation, to described first Bidirectional transducer model and the first language model are trained, the second bidirectional transducer model and second after being trained Language model.It is understood that the training sentence that model training uses is convenient for without artificial mark using widely without mark Corpus is trained model.

The method provided by this specification embodiment, first with the first bidirectional transducer model, for the first training Each word in sentence, the information above of initial term vector and the word based on the word in the first training sentence, obtains To the corresponding positive vector of the word；Followed by the first bidirectional transducer model, in the first training sentence Each word, the context information of initial term vector and the word based on the word in the first training sentence, obtains the word pair The opposite vector answered；Then according to the position of each word in the first training sentence, just by the previous word of the position It is stitched together to the opposite vector of the latter word of the vector sum position, as the corresponding target term vector in the position；It recycles First language model, for the corresponding target term vector in position each in the first training sentence, prediction obtains the position pair First probability of the word answered；Finally by the first-loss Function Minimization made with first probability correlation, to described first Bidirectional transducer model and the first language model are trained, the second bidirectional transducer model and second after being trained Language model.In this specification embodiment, different from common unidirectional Transformer model, bidirectional transducer model is abundant The contextual information of each word is considered, rather than only considers information above, when carrying out depth modelling to sequence data, Neng Gouli With the fireballing feature of Transformer model running, and guarantee the robustness of model.

Fig. 3 shows the model training method flow chart for text analyzing according to another embodiment, which can To correspond to the fine tuning stage mentioned in application scenarios shown in Fig. 1.As shown in figure 3, model training method includes in the embodiment Following steps:

First in step 31, using the second bidirectional transducer model after the training of method described in Fig. 2, for second Each word in training sentence, the letter above of initial term vector and the word based on the word in the second training sentence Breath obtains the corresponding positive vector of the word.It is understood that obtaining the corresponding positive vector of each word in the step 31 Process is similar with unidirectional Transformer model.

Then in step 32, using the second bidirectional transducer model, for each of described second training sentence Word, the context information of initial term vector and the word based on the word in the second training sentence, it is corresponding to obtain the word Opposite vector.It is understood that the word is utilized during obtaining the corresponding opposite vector of each word in the step 32 Context information.

In one example, using the second bidirectional transducer model, for each of described second training sentence Word, using from attention mechanism, initial term vector and the word based on the word train the hereafter letter in sentence described second Breath, extracts multiple important informations from different perspectives；The corresponding vector of important information each in the multiple important information is carried out Splicing, obtains the corresponding opposite vector of the word.

Then in step 33, according to the position of each word in the second training sentence, by the previous word of the position The opposite vector of the latter word of the positive vector sum position be stitched together, as the corresponding target term vector in the position.It can With understanding, the corresponding target term vector in each position not only embodied the position before it is above, but also embody the position it Afterwards hereafter, robustness is good.

Again in step 34, using the second language model after the training of method described in Fig. 2, for second training The corresponding target term vector in each position in sentence, prediction obtain the first probability of the corresponding word in the position.It is understood that Obtain the corresponding target term vector in each position in the second training sentence using bidirectional transducer model, then by language model according to Target word vector forecasting obtains the probability of the corresponding word in the position.

Again in step 35, according to the corresponding target term vector in position each in the second training sentence, described the is generated The expression vector of the corresponding sentence of two training sentences.It is understood that generating the corresponding sentence of the second training sentence Indicate vector, combine the corresponding target term vector in multiple positions, rather than just the corresponding target word in one of position to Amount.

In one example, the corresponding target term vector in position each in the second training sentence is taken into mean value, by institute State expression vector of the mean value as the corresponding sentence of the second training sentence.

Again in step 36, using more disaggregated models, based on the expression vector of the corresponding sentence of the second training sentence, in advance Survey the second probability of the second training sentence corresponding label.It is understood that label is the text classification marked in advance Classification.

Finally in step 37, by make first-loss function and the second loss function and minimization, to described second pair Be trained to converter model, the second language model and more disaggregated models, obtain third bidirectional transducer model, Third language model and more than second disaggregated model；Wherein, the first-loss function and first probability correlation, described second Loss function and second probability correlation.It is understood that the training sentence that model training uses need to be marked manually, it is convenient for Using have mark corpus model is further trained.

In one example, by gradient descent method make the first-loss function with it is the second loss function and minimum Change, with the model parameter of determination the second bidirectional transducer model, the second language model and more disaggregated models.

The method that this specification embodiment provides, first with described second two-way turn after the training of method described in Fig. 2 Parallel operation model, for each word in the second training sentence, initial term vector and the word based on the word are in second instruction Practice the information above in sentence, obtains the corresponding positive vector of the word；Followed by the second bidirectional transducer model, for Each word in the second training sentence, initial term vector and the word based on the word are in the second training sentence Context information, obtain the corresponding opposite vector of the word；It then, will according to the position of each word in the second training sentence The opposite vector of the latter word of the positive vector sum position of the previous word of the position is stitched together, corresponding as the position Target term vector；The second language model after recycling the training of method described in Fig. 2, for the second training sentence In the corresponding target term vector in each position, prediction obtains the first probability of the corresponding word in the position；And according to described second The corresponding target term vector in each position in training sentence generates the expression vector of the corresponding sentence of the second training sentence； Second training is predicted based on the expression vector of the corresponding sentence of the second training sentence followed by more disaggregated models Second probability of sentence corresponding label；Finally by make first-loss function and the second loss function and minimization, to described Second bidirectional transducer model, the second language model and more disaggregated models are trained, and obtain third bi-directional conversion Device model, third language model and more than second disaggregated model；Wherein, the first-loss function and first probability correlation, Second loss function and second probability correlation.In this specification embodiment, depth not only is being carried out to sequence data When modeling, the fireballing feature of Transformer model running can be utilized, and guarantee the robustness of model；Moreover, On the one hand on the basis of carrying out model training to bidirectional transducer model and language model, further to bidirectional transducer model, Language model and more disaggregated models carry out joint training, reach better model training effect.

Fig. 4 shows the file classification method flow chart according to another embodiment, which can correspond to shown in Fig. 1 The forecast period mentioned in application scenarios.As shown in figure 4, in the embodiment file classification method the following steps are included:

First in step 41, using the third bidirectional transducer model after the training of method described in Fig. 3, for wait divide Each word in quasi-sentence, the information above of initial term vector and the word in the sentence to be sorted based on the word, obtains To the corresponding positive vector of the word.It is understood that in the step 41, obtain the process of the corresponding positive vector of each word with Unidirectional Transformer model is similar.

Then in step 42, using the third bidirectional transducer model, for each word in the sentence to be sorted, It is corresponding reversed to obtain the word for the context information of initial term vector and the word in the sentence to be sorted based on the word Vector.It is understood that the word is utilized hereafter during obtaining the corresponding opposite vector of each word in the step 42 Information.

Then in step 43, according to the position of each word in the sentence to be sorted, by the previous word of the position The opposite vector of the latter word of the positive vector sum position is stitched together, as the corresponding target term vector in the position.It can be with Understand, the corresponding target term vector in each position not only embodied the position before it is above, but also after embodying the position Hereafter, robustness is good.

It is generated described wait divide in step 44 according to the corresponding target term vector in position each in the sentence to be sorted again The expression vector of the corresponding sentence of quasi-sentence.It is understood that generate the expression of the corresponding sentence of the sentence to be sorted to Amount, combines the corresponding target term vector in multiple positions, rather than just the corresponding target term vector in one of position.

Finally in step 45, using more than second disaggregated model after the training of method described in Fig. 3, based on described wait divide The expression vector of the corresponding sentence of quasi-sentence carries out text classification to the sentence to be sorted.It is understood that utilizing more points Class model predicts that the sentence to be sorted corresponds to each classification based on the expression vector of the corresponding sentence of the sentence to be sorted Probability, take the classification of maximum probability as the result of text classification.

The method that this specification embodiment provides, first with two-way turn of the third after the training of method described in Fig. 3 Parallel operation model, for each word in sentence to be sorted, initial term vector and the word based on the word are in the language to be sorted Information above in sentence obtains the corresponding positive vector of the word；Followed by the third bidirectional transducer model, for described Each word in sentence to be sorted, the hereafter letter of initial term vector and the word in the sentence to be sorted based on the word Breath, obtains the corresponding opposite vector of the word；Then according to the position of each word in the sentence to be sorted, before the position The opposite vector of the latter word of the positive vector sum position of one word is stitched together, as the corresponding target word in the position to Amount；Further according to the corresponding target term vector in position each in the sentence to be sorted, the corresponding sentence of the sentence to be sorted is generated The expression vector of son；Finally using more than second disaggregated model after the training of method described in Fig. 3, it is based on the language to be sorted The expression vector of the corresponding sentence of sentence carries out text classification to the sentence to be sorted.In this specification embodiment, to sequence When data carry out depth modelling, the fireballing feature of Transformer model running can be utilized, and guarantee the robust of model Property, and bidirectional transducer model and more disaggregated models after two stages training, be conducive to obtain preferable text classification As a result.

Three kinds of models involved in previous embodiment will specifically be introduced below: two-way Transformer model is (referred to as two-way Transformer), language model and more disaggregated models (also referred to as multi-categorizer).

Wherein, two-way Transformer:

Then the working principle for introducing tradition Transformer (i.e. unidirectional Transformer) first is expanded to double To on Transformer.

1, unidirectional Transformer

Transformer be Ashish Vaswani of Google et al. propose, for by text sequence be converted into Amount, Transformer overcome LSTM processing text when need by word calculate the shortcomings that, by each word on it text in make Information above is obtained with attention mechanism (attention mechanism).In this process, the output vector of each word Calculating can be parallel.

Fig. 5 is the schematic diagram of internal structure for the unidirectional Transformer model that this specification embodiment provides, referring to Fig. 5:

The input of Transformer module is a sequence vector X={ x₁,x₂,…,x_N, wherein x_iFor i-th of position Expression vector.X first passes around bull from noticing that power module (multi-head self attention) makes each word and thereon Each word generates interaction in text, to each word plus its above associated important information.Bull from pay attention to power module by Multiple structures are identical from power module composition is paid attention to, wherein each notice that the calculating process of power module is as follows certainly:

Use entirely connected layer (feed forward) by each word x first_iIt is converted into two vector k_iAnd t_i:

k_i=tanh (W^qx_i+b)

t_i=tanh (W^vx_i+b)

W^qAnd W^vFor parameter trainable in model.k_iFor calculating x_iAll the above word { x₁,…,x_i-1For x_iWeight The property wanted, t_iThen for storing x_iIn information, be supplied to other words use:

Obtained vector c_iAs extract from the above to x_iUseful important information.Even if bull is from attention mechanism With multiple above-mentioned attention power modules, each word x is given from different perspectives_iFrom { x above₁,…,x_i-1In extract important information.Finally The vector that all attention power modules are extracted to each word is spliced into d_i, as bull is from attention power module for the defeated of each word Outgoing vector.

The output vector d of each word_iUsing normalization layer (layer normalization)-entirely be connected layer (feed Forward output vector l is obtained after)-normalization layer_i, as x_iBy the vector after the conversion of Transformer module, meter Calculation process is as follows:

l_i=LayerNorm (LayerNorm (x_i+d_i)+W·LayerNorm(x_i+d_i)

Wherein W is trainable parameter, and LayerNorm is used to normalize one layer of neural network, make between layer Information flow is more stable.LayerNorm calculation is as follows:

Wherein μ is the mean value of all neurons in one layer of neural network, and σ is the mark of all neurons in one layer of neural network It is quasi- poor.

Transformer module can carry out multiple-level stack, and the output of next layer of Transformer is as upper one layer The input of Transformer, to form multilayer Transformer network.The calculating process of Transformer network can be with table It is shown as

2, two-way Transformer

Two-way Transformer is the extension of unidirectional Transformer, is embodied in unidirectional Transformer in attention Information above is only considered in mechanism, has ignored context information.And the information hereafter of each word hereafter is also useful to itself 's.Therefore two-way Transformer models sentence from context both direction, increases the ability to express of model.It was calculated Journey is as follows:

WhereinWithRespectively indicate from it is upper and lower text in extract to x_iImportant information.Utilize two-way Transformer Obtain d_iLater, as unidirectional Transformer,WithAlso it is obtained respectively by normalization layer and the full layer that is connected two-way Transformer is for x_iTwo output vectors.After multi-layer biaxially oriented Transformer, the last one sentence S= {w₁,w₂,…,w_NIt is converted into two groups of vectorsWithThe calculating of two-way Transformer Journey can be expressed as

Wherein, language model:

One sentence S={ w₁,w₂,…,w_NBy being converted into two groups of vectors after multi-layer biaxially oriented TransformerWithIt can be by language model task come pre-training model, because of language model task Data do not need to mark, therefore are very easy to obtain mass data to the abundant pre-training of model progress.The mesh of language model task Be by each word x_iContext { x₁,…,x_i-1,x_i+1,…,x_NPredict x_i, so that model learning is to the interior of natural language In rule, if as soon as model can by context come it is correctly predicted go out each word, then this model is well The inherent law of natural language is acquired.The calculating process of two-way Transformer language model is as follows:

First by the positive vector of (i-1)-th wordWith the opposite vector of i+1 wordIt is stitched together

Then v is used_iPredict i-th of word w_iProbability:

Wherein W^LMIt is trainable parameter, W in language model_j ^LMIndicate W^LMJth row.

The loss function of language model is the mean value of the cross entropy loss function of all words in sentence

The purpose of language model is to want minimization L^LM, remember in two-way Transformer it is all can training parameter collection be combined into W, In language model it is all can training parameter collection be combined into W^LM.W and W^LMOptimization is iterated by gradient descent method:

By iteration optimization model when pre-training, until L^LMThe threshold value beta set less than one (can usually take 0.1,0.01 Deng), model just trains, and model has acquired the inherent law of natural language at this time.γ₁Usually take the reality of 0.0001 magnitude Number.

Wherein, multi-categorizer:

Fine tuning the stage, using have the data of true tag to Jing Guo pre-training two-way Transformer model progress Fine tuning.Because two-way Transformer passes through pre-training, the knowledge of natural language is grasped, therefore first compared to random Beginningization model, the training directly on having label data can reach better effect by the fine tuning of pre-training bonus point class.

Trim process includes two parts, and a part is language model portion identical with pre-training process, and another part is To the sentence (S={ w of each input₁,w₂,…,w_N, l) classify, wherein l is that the labeling process of sentence is as follows:

First with two-way Transformer by S={ w₁,w₂,…,w_NIt is converted into vector [v₁,…,v_N], wherein it is each to Amount is all the expression vector of corresponding position word.Then to the expression of all words to measuring the obtained vector of mean value as entire sentence Expression vector.

Then Softmax classifier pair is usedCalculate the probability that S belongs to each label:

Wherein W^cFor parameter sets trainable in multi-categorizer, W_k ^cIndicate W^cIn row k.The loss function of classifier Belong to the cross entropy of its true tag l for each sample, i.e.,

L^C=-logp^C(l|S)

In trim process, the purpose of model is the loss function L by minimization language model^LMWith the loss letter of classifier Number L^CThe sum of L=L^LM+L^C, all parameters in model by gradient descent method iteration optimization,

γ₂Value usually compare γ₁A small magnitude is 0.00001 or so.

In forecast period, by only needing two-way Transformer that sentence S is converted into vectorThen multi-categorizer is used It calculates S and belongs to each label l_kProbability, finally take the label of maximum probability to export.

L=argmax_kp^c(l_k|S)

So far, it is achieved that the effect that textual classification model is improved using two-way Transformer.

It should be noted that multitask classifier and task arbiter are all not limited only to softmax classifier, it is all can be into The model of row classification can serve as multitask classifier and task arbiter, such as support vector machines, logistic regression, multilayer mind Through network etc..

According to the embodiment of another aspect, a kind of model training apparatus for text analyzing is also provided, described device is used In the model training method for text analyzing for executing the offer of this specification embodiment, for example, shown in Fig. 2 for text point The model training method of analysis.Fig. 6 shows the schematic frame of the model training apparatus for text analyzing according to one embodiment Figure.As shown in fig. 6, the device 600 includes:

Positive vector generation unit 61, for utilizing the first bidirectional transducer model, for every in the first training sentence A word, the information above of initial term vector and the word based on the word in the first training sentence, it is corresponding to obtain the word Positive vector；

Opposite vector generation unit 62, for utilizing the first bidirectional transducer model, for the first training language Each word in sentence, the context information of initial term vector and the word based on the word in the first training sentence, obtains The corresponding opposite vector of the word；

Term vector generation unit 63, for the position according to each word in the first training sentence, by the forward direction What opposite vector generation unit 62 described in the positive vector sum of the previous word for the position that vector generation unit 61 obtains obtained The opposite vector of the latter word of the position is stitched together, as the corresponding target term vector in the position；

Predicting unit 64, for utilizing first language model, obtained for the term vector generation unit 63 described the The corresponding target term vector in each position in one training sentence, prediction obtain the first probability of the corresponding word in the position；

Model training unit 65, for the first-loss by making the first probability correlation obtained with the predicting unit 64 Function Minimization is trained the first bidirectional transducer model and the first language model, and after being trained Two bidirectional transducer models and second language model.

Optionally, as one embodiment, the opposite vector generation unit 62 is specifically used for:

The device provided by this specification embodiment, vector generation unit 61 positive first utilize the first bidirectional transducer Model, for each word in the first training sentence, initial term vector and the word based on the word are in the first training language Information above in sentence obtains the corresponding positive vector of the word；Then opposite vector generation unit 62 is two-way using described first Converter model, for each word in the first training sentence, initial term vector and the word based on the word are described Context information in first training sentence, obtains the corresponding opposite vector of the word；Then term vector generation unit 63 is according to described The position of each word in first training sentence, by the latter word of the positive vector sum position of the previous word of the position Opposite vector is stitched together, as the corresponding target term vector in the position；Predicting unit 64 recycles first language model, for The corresponding target term vector in each position in the first training sentence, prediction obtain the first probability of the corresponding word in the position； Last model training unit 65 is two-way to described first by making the first-loss Function Minimization with first probability correlation Converter model and the first language model are trained, the second bidirectional transducer model and second language after being trained Model.In this specification embodiment, different from common unidirectional Transformer model, bidirectional transducer model is fully considered The contextual information of each word, rather than only consider information above, when carrying out depth modelling to sequence data, can utilize The fireballing feature of Transformer model running, and guarantee the robustness of model.

According to the embodiment of another aspect, a kind of model training apparatus for text analyzing is also provided, described device is used In the model training method for text analyzing for executing the offer of this specification embodiment, for example, shown in Fig. 3 for text point The model training method of analysis.Fig. 7 shows the schematic frame of the model training apparatus for text analyzing according to one embodiment Figure.As shown in fig. 7, the device 700 includes:

Positive vector generation unit 71, for utilizing the second bidirectional transducer mould after the training of method described in Fig. 2 Type, for each word in the second training sentence, initial term vector and the word based on the word are in the second training sentence In information above, obtain the corresponding positive vector of the word；

Opposite vector generation unit 72, for utilizing the second bidirectional transducer model, for the second training language Each word in sentence, the context information of initial term vector and the word based on the word in the second training sentence, obtains The corresponding opposite vector of the word；

Term vector generation unit 73, for the position according to each word in the second training sentence, by the forward direction What opposite vector generation unit 72 described in the positive vector sum of the previous word for the position that vector generation unit 71 obtains obtained The opposite vector of the latter word of the position is stitched together, as the corresponding target term vector in the position；

First predicting unit 74, for the second language model after being trained using method described in Fig. 2, for described The corresponding target term vector in each position in second training sentence, prediction obtain the first probability of the corresponding word in the position；

Sentence vector generation unit 75, the second training sentence for being obtained according to the term vector generation unit 73 In the corresponding target term vector in each position, generate the expression vector of the corresponding sentence of the second training sentence；

Second predicting unit 76, for utilizing more disaggregated models, the institute obtained based on the sentence vector generation unit 75 The expression vector for stating the corresponding sentence of the second training sentence predicts the second probability of the second training sentence corresponding label；

Model training unit 77, for by make first-loss function and the second loss function and minimization, to described Second bidirectional transducer model, the second language model and more disaggregated models are trained, and obtain third bi-directional conversion Device model, third language model and more than second disaggregated model；Wherein, the first-loss function and first probability correlation, Second loss function and second probability correlation.

Optionally, as one embodiment, the opposite vector generation unit 72 is specifically used for:

Optionally, as one embodiment, the sentence vector generation unit 75 is specifically used for the second training language The corresponding target term vector in each position takes mean value in sentence, using the mean value as the corresponding sentence of the second training sentence Indicate vector.

Optionally, as one embodiment, the model training unit 77 is described specifically for being made by gradient descent method First-loss function and the second loss function and minimization, with determination the second bidirectional transducer model, second language Say the model parameter of model and more disaggregated models.

The device that this specification embodiment provides, vector generation unit 71 positive first utilize the training of method described in Fig. 2 The second bidirectional transducer model afterwards, for each word in the second training sentence, the initial term vector based on the word, with And above information of the word in the second training sentence, obtain the corresponding positive vector of the word；Then opposite vector generates Unit 72 utilizes the second bidirectional transducer model, first based on the word for each word in the second training sentence The context information of beginning term vector and the word in the second training sentence, obtains the corresponding opposite vector of the word；Then word Vector generation unit 73 according to the position of each word in the second training sentence, by the previous word of the position it is positive to The opposite vector of the latter word of amount and the position is stitched together, as the corresponding target term vector in the position；First prediction is single Member 74 recycles the second language model after the training of method described in Fig. 2, for each position in the second training sentence Corresponding target term vector is set, prediction obtains the first probability of the corresponding word in the position；And sentence vector generation unit 75 According to the corresponding target term vector in position each in the second training sentence, the corresponding sentence of the second training sentence is generated Indicate vector；Then the second predicting unit 76 utilizes more disaggregated models, the table based on the corresponding sentence of the second training sentence Show vector, predicts the second probability of the second training sentence corresponding label；Last model training unit 77 is by making the first damage Lose function and the second loss function and minimization, to the second bidirectional transducer model, the second language model and institute It states more disaggregated models to be trained, obtains third bidirectional transducer model, third language model and more than second disaggregated model；Its In, the first-loss function and first probability correlation, second loss function and second probability correlation.This theory It, can be fast using Transformer model running speed not only when carrying out depth modelling to sequence data in bright book embodiment The characteristics of, and guarantee the robustness of model；Moreover, carrying out model to bidirectional transducer model and language model in first aspect On the basis of training, joint training further is carried out to bidirectional transducer model, language model and more disaggregated models, reaches more preferable Model training effect.

According to the embodiment of another aspect, a kind of document sorting apparatus is also provided, described device is for executing this specification The file classification method that embodiment provides, for example, file classification method shown in Fig. 4.Fig. 8 shows the text according to one embodiment The schematic block diagram of this sorter.As shown in figure 8, the device 800 includes:

Positive vector generation unit 81, for utilizing the third bidirectional transducer mould after the training of method described in Fig. 3 Type, for each word in sentence to be sorted, initial term vector and the word based on the word are in the sentence to be sorted Information above obtains the corresponding positive vector of the word；

Opposite vector generation unit 82, for utilizing the third bidirectional transducer model, for the sentence to be sorted In each word, the context information of initial term vector and the word in the sentence to be sorted based on the word obtains the word Corresponding opposite vector；

Term vector generation unit 83, for the position according to each word in the sentence to be sorted, by it is described it is positive to What opposite vector generation unit 82 described in the positive vector sum of the previous word for the position that amount generation unit 81 obtains obtained should The opposite vector of the latter word of position is stitched together, as the corresponding target term vector in the position；

Sentence vector generation unit 84, in the sentence to be sorted for being obtained according to the term vector generation unit 83 The corresponding target term vector in each position generates the expression vector of the corresponding sentence of the sentence to be sorted；

Text classification unit 85, for being based on institute using more than second disaggregated model after the training of method described in Fig. 3 The expression vector for the corresponding sentence of the sentence to be sorted for stating that sentence vector generation unit 84 obtains, to the sentence to be sorted Carry out text classification.

The device that this specification embodiment provides, vector generation unit 81 positive first utilize the training of method described in Fig. 3 The third bidirectional transducer model afterwards, for each word in sentence to be sorted, the initial term vector based on the word, and Above information of the word in the sentence to be sorted obtains the corresponding positive vector of the word；Then opposite vector generation unit 82 utilize the third bidirectional transducer models, for each word in the sentence to be sorted, based on the initial word of the word to The context information of amount and the word in the sentence to be sorted, obtains the corresponding opposite vector of the word；Then term vector generates Unit 83 is according to the position of each word in the sentence to be sorted, by the positive vector sum of the previous word of the position position The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position；Sentence vector generation unit 84 is again According to the corresponding target term vector in position each in the sentence to be sorted, the table of the corresponding sentence of the sentence to be sorted is generated Show vector；Last text classification unit 85 is using more than second disaggregated model after the training of method described in Fig. 3, based on described The expression vector of the corresponding sentence of sentence to be sorted carries out text classification to the sentence to be sorted.In this specification embodiment, When carrying out depth modelling to sequence data, the fireballing feature of Transformer model running can be utilized, and guarantee mould The robustness of type, and bidirectional transducer model and more disaggregated models after two stages training, are conducive to obtain preferably Text classification result.

According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey Sequence enables computer execute and combines method described in Fig. 2 to Fig. 4 when the computer program executes in a computer.

According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided In be stored with executable code, when the processor executes the executable code, realize and combine side described in Fig. 2 to Fig. 4 Method.

Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.

Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all Including within protection scope of the present invention.

Claims

1. a kind of model training method for text analyzing, which comprises

Using the first bidirectional transducer model, for each word in the first training sentence, the initial term vector based on the word, with And above information of the word in the first training sentence, obtain the corresponding positive vector of the word；

Using the first bidirectional transducer model, for each word in the first training sentence, based on the initial of the word The context information of term vector and the word in the first training sentence, obtains the corresponding opposite vector of the word；

According to the position of each word in the first training sentence, by the positive vector sum of the previous word of the position position The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position；

Using first language model, for the corresponding target term vector in position each in the first training sentence, prediction is obtained First probability of the corresponding word in the position；

By making the first-loss Function Minimization with first probability correlation, to the first bidirectional transducer model and institute It states first language model to be trained, the second bidirectional transducer model and second language model after being trained.

2. the method for claim 1, wherein described utilize the first bidirectional transducer model, for described first Each word in training sentence, the hereafter letter of initial term vector and the word based on the word in the first training sentence Breath, obtains the corresponding opposite vector of the word, comprising:

Using the first bidirectional transducer model, for each word in the first training sentence, using from attention machine System, the context information of initial term vector and the word based on the word in the first training sentence, is extracted from different perspectives Multiple important informations；

The corresponding vector of important information each in the multiple important information is spliced, obtain the word it is corresponding reversely to Amount.

3. a kind of model training method for text analyzing, which comprises

Using the second bidirectional transducer model after the method as described in claim 1 training, for the second training sentence In each word, initial term vector and the word based on the word it is described second training sentence in information above, be somebody's turn to do The corresponding positive vector of word；

Using the second bidirectional transducer model, for each word in the second training sentence, based on the initial of the word The context information of term vector and the word in the second training sentence, obtains the corresponding opposite vector of the word；

According to the position of each word in the second training sentence, by the positive vector sum of the previous word of the position position The opposite vector of the latter word be stitched together, as the corresponding target term vector in the position；

Using the second language model after the method as described in claim 1 training, in the second training sentence The corresponding target term vector in each position, prediction obtain the first probability of the corresponding word in the position；And according to second instruction Practice the corresponding target term vector in each position in sentence, generates the expression vector of the corresponding sentence of the second training sentence；

Second training is predicted based on the expression vector of the corresponding sentence of the second training sentence using more disaggregated models Second probability of sentence corresponding label；

By make first-loss function and the second loss function and minimization, to the second bidirectional transducer model, described Second language model and more disaggregated models are trained, and obtain third bidirectional transducer model, third language model and More than two disaggregated models；Wherein, the first-loss function and first probability correlation, second loss function and described the Two probability correlations.

4. method as claimed in claim 3, wherein it is described to utilize the second bidirectional transducer model, for described second Each word in training sentence, the hereafter letter of initial term vector and the word based on the word in the second training sentence Breath, obtains the corresponding opposite vector of the word, comprising:

Using the second bidirectional transducer model, for each word in the second training sentence, using from attention machine System, the context information of initial term vector and the word based on the word in the second training sentence, is extracted from different perspectives Multiple important informations；

5. method as claimed in claim 3, wherein described according to the corresponding target in position each in the second training sentence Term vector generates the expression vector of the corresponding sentence of the second training sentence, comprising:

The corresponding target term vector in position each in the second training sentence is taken into mean value, using the mean value as described second The expression vector of the corresponding sentence of training sentence.

6. method as claimed in claim 3, wherein it is described by make the first-loss function and the second loss function and Minimization is trained the second bidirectional transducer model, the second language model and more disaggregated models, packet It includes:

Make the first-loss function and the second loss function by gradient descent method and minimization, with determination described second pair To the model parameter of converter model, the second language model and more disaggregated models.

7. a kind of file classification method, which comprises

Using the third bidirectional transducer model after method as claimed in claim 3 training, in sentence to be sorted Each word, the information above of initial term vector and the word in the sentence to be sorted based on the word obtains the word pair The positive vector answered；

Utilize the third bidirectional transducer model, for each word in the sentence to be sorted, the initial word based on the word The context information of vector and the word in the sentence to be sorted, obtains the corresponding opposite vector of the word；

According to the position of each word in the sentence to be sorted, by the positive vector sum position of the previous word of the position The opposite vector of the latter word is stitched together, as the corresponding target term vector in the position；

According to the corresponding target term vector in position each in the sentence to be sorted, the corresponding sentence of the sentence to be sorted is generated Expression vector；

Using more than second disaggregated model after method as claimed in claim 3 training, it is based on the sentence pair to be sorted The expression vector for the sentence answered carries out text classification to the sentence to be sorted.

8. a kind of model training apparatus for text analyzing, described device include:

Positive vector generation unit, for utilizing the first bidirectional transducer model, for each word in the first training sentence, base The information above in sentence is trained described first in the initial term vector of the word and the word, obtains the corresponding forward direction of the word Vector；

Opposite vector generation unit, for utilizing the first bidirectional transducer model, in the first training sentence Each word, the context information of initial term vector and the word based on the word in the first training sentence, obtains the word pair The opposite vector answered；

Term vector generation unit, it is for the position according to each word in the first training sentence, the positive vector is raw Behind the position that opposite vector generation unit described in the positive vector sum of the previous word of the position obtained at unit obtains The opposite vector of one word is stitched together, as the corresponding target term vector in the position；

Predicting unit is used to utilize first language model, the first training language obtained for the term vector generation unit The corresponding target term vector in each position in sentence, prediction obtain the first probability of the corresponding word in the position；

Model training unit, for by keeping the first-loss function of the first probability correlation obtained with the predicting unit minimum Change, the first bidirectional transducer model and the first language model is trained, two-way turn of second after being trained Parallel operation model and second language model.

9. device as claimed in claim 8, wherein the opposite vector generation unit is specifically used for:

10. a kind of model training apparatus for text analyzing, described device include:

Positive vector generation unit, for utilizing second bidirectional transducer after the method as described in claim 1 training Model, for each word in the second training sentence, initial term vector and the word based on the word are in the second training language Information above in sentence obtains the corresponding positive vector of the word；

Opposite vector generation unit, for utilizing the second bidirectional transducer model, in the second training sentence Each word, the context information of initial term vector and the word based on the word in the second training sentence, obtains the word pair The opposite vector answered；

Term vector generation unit, it is for the position according to each word in the second training sentence, the positive vector is raw Behind the position that opposite vector generation unit described in the positive vector sum of the previous word of the position obtained at unit obtains The opposite vector of one word is stitched together, as the corresponding target term vector in the position；

First predicting unit, for the second language model after being trained using the method as described in claim 1, for institute The corresponding target term vector in each position in the second training sentence is stated, prediction obtains the first probability of the corresponding word in the position；

Sentence vector generation unit, described second for being obtained according to the term vector generation unit trains each position in sentence Corresponding target term vector is set, the expression vector of the corresponding sentence of the second training sentence is generated；

Second predicting unit is used to utilize more disaggregated models, second instruction obtained based on the sentence vector generation unit Practice the expression vector of the corresponding sentence of sentence, predicts the second probability of the second training sentence corresponding label；

Model training unit, for by make first-loss function and the second loss function and minimization, to described second pair Be trained to converter model, the second language model and more disaggregated models, obtain third bidirectional transducer model, Third language model and more than second disaggregated model；Wherein, the first-loss function and first probability correlation, described second Loss function and second probability correlation.

11. device as claimed in claim 10, wherein the opposite vector generation unit is specifically used for:

12. device as claimed in claim 10, wherein the sentence vector generation unit, specifically for described second is instructed Practice the corresponding target term vector in each position in sentence and take mean value, using the mean value as the corresponding sentence of the second training sentence The expression vector of son.

13. device as claimed in claim 10, wherein the model training unit, specifically for being made by gradient descent method The first-loss function and the second loss function and minimization, with determination the second bidirectional transducer model, described The model parameter of two language models and more disaggregated models.

14. a kind of document sorting apparatus, described device include:

Positive vector generation unit, for utilizing the third bidirectional transducer after method as claimed in claim 3 training Model, for each word in sentence to be sorted, initial term vector and the word based on the word are in the sentence to be sorted Information above, obtain the corresponding positive vector of the word；

Opposite vector generation unit, for utilizing the third bidirectional transducer model, for every in the sentence to be sorted A word, the context information of initial term vector and the word in the sentence to be sorted based on the word, it is corresponding to obtain the word Opposite vector；

Term vector generation unit generates the positive vector for the position according to each word in the sentence to be sorted The position that opposite vector generation unit described in the positive vector sum of the previous word for the position that unit obtains obtains it is latter The opposite vector of a word is stitched together, as the corresponding target term vector in the position；

Sentence vector generation unit, each position in the sentence to be sorted for being obtained according to the term vector generation unit Corresponding target term vector generates the expression vector of the corresponding sentence of the sentence to be sorted；

Text classification unit, for being based on using more than second disaggregated model after method as claimed in claim 3 training The expression vector for the corresponding sentence of the sentence to be sorted that the sentence vector generation unit obtains, to the sentence to be sorted Carry out text classification.

15. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require the method for any one of 1-7.

16. a kind of calculating equipment, including memory and processor, executable code, the processing are stored in the memory When device executes the executable code, the method for any one of claim 1-7 is realized.