Summary of the invention
This specification one or more embodiment describes a kind of natural language processing method and apparatus that computer executes,
Using Natural Language Processing Models model present in rule, be automatically performed the modeling of Natural Language Processing Models, and pass through mould
Type evaluation mechanism adds up outstanding experience, to improve the validity of natural language processing.
According in a first aspect, providing a kind of method of natural language processing that computer executes, which comprises obtain
Take multiple texts by manually marking as initial sample;Determine that each text is corresponding each in the initial sample
Text matrix;Each text in the initial sample is expanded by least one of pre-stored expansion scheme collection respectively
Exhibition scheme is extended, and extends sample according to the text generation that extension obtains, wherein for each in the initial sample
Text, the text that text matrix meets condition for consistence correspond to identical expansion scheme;By the initial sample and the extension
Sample is collectively as training sample training Natural Language Processing Models;The Natural Language Processing Models trained are assessed,
And in the case where assessment result meets predetermined condition, current Natural Language Processing Models are exported, to be used for natural language processing.
In one embodiment, described by the multiple texts manually marked is pre- place by segmenting, removing stop words
The text of reason.
In one embodiment, the condition for consistence includes at least one of the following:
The deviation of corresponding text matrix length and width is all within a predetermined range;
Matrix similarity is greater than predetermined matrices similarity threshold;
Cluster same category.
In one embodiment, the expansion scheme includes term vector extension, and the term vector comprises at least one of the following:
The term vector of word-based insertion, the term vector based on stroke, the term vector based on affixe.
In one embodiment, multiple texts in the initial sample include the first text, and first text includes
First vocabulary;And
Each text in the initial sample passes through at least one expansion scheme in expansion scheme collection respectively
It is extended and includes:
For first vocabulary, detect in corpus with the presence or absence of big with the similarity of the term vector of first vocabulary
In the similar vocabulary of predetermined vocabulary similarity threshold;
There are the similar vocabulary, first vocabulary is replaced with the similar vocabulary, to described the
One text is extended.
In one embodiment, the expansion scheme includes multilingual translation extension, multiple texts in the initial sample
This includes the first text, and first text is described by first language;
Each text in the initial sample passes through at least one expansion scheme in expansion scheme collection respectively
It is extended and includes:
First text is passed through into the second text that language conversion model translation is described at second language;
By second text by language conversion model translation at the third text described by the first language;
The text that extension obtains is determined according to the third text.
In one embodiment, the text generation extension sample that extension is obtained includes:
It is labeled by marking model trained in advance to obtained text is extended, and by annotation results and corresponding
It extends obtained text and is formed together extension sample.
In one embodiment, the predetermined condition includes at least one of the following:
The area under the curve AUC for the Natural Language Processing Models trained is greater than first threshold;
The balance F score for the Natural Language Processing Models trained is greater than second threshold, wherein the balance F score is
The weighted average of accuracy rate and recall rate.
According to second aspect, a kind of device of natural language processing is provided, comprising:
Acquiring unit is configured to obtain multiple texts by manually marking as initial sample;
Determination unit is configured to determine the corresponding each text matrix of each text in the initial sample;
Expanding element is configured to each text in the initial sample respectively by pre-stored expansion scheme collection
In at least one expansion scheme be extended, and sample is extended according to the obtained text generation of extension, wherein for it is described just
Each text in beginning sample, the text that text matrix meets condition for consistence correspond to identical expansion scheme;
Training unit, by the initial sample and the extension sample collectively as training sample training natural language processing
Model;
Assessment unit assesses the Natural Language Processing Models trained, and meets predetermined condition in assessment result
In the case where, current Natural Language Processing Models are exported, to be used for natural language processing.
According to the third aspect, a kind of computer readable storage medium is provided, computer program is stored thereon with, when described
When computer program executes in a computer, enable computer execute first aspect method.
According to fourth aspect, a kind of calculating equipment, including memory and processor are provided, which is characterized in that described to deposit
It is stored with executable code in reservoir, when the processor executes the executable code, the method for realizing first aspect.
The method and apparatus for the natural language processing that the computer that this specification embodiment provides executes, by by initial sample
This is expanded sample according to text matrix-expand, and by initial sample and extension sample together as training sample training nature
Language Processing model, then basis determines whether the assessment of current Natural Language Processing Models to export current natural language processing
Model.Wherein, initial sample is extended by the plurality of optional scheme that expansion scheme is concentrated, various expansion schemes can phase
Mutually combination, and text matrix size meets the text of condition for consistence and corresponds to identical expansion scheme.On the one hand extension improves
Efficiency, on the other hand, raising extension are comprehensive, in another aspect, can assess spreading result.In this way, can be improved certainly
The validity of right Language Processing.
Specific embodiment
With reference to the accompanying drawing, the scheme provided this specification is described.
Fig. 1 is an implementation configuration diagram of this specification embodiment.In the implementation framework, natural language processing
Method is executed by the computing platform in Fig. 1.The computing platform can be set to have computing capability various equipment (computer) or
In device clusters.In the implementation framework, computing platform multiple texts by manually marking available first are used as initially
Sample.It is appreciated that an initial sample corresponds to a text, and the sample label manually marked.Further, for each
A text, computing platform can also convert it into text matrix.Then, computing platform can be for each in initial sample
Text is extended.Computing platform can be previously stored with the expansion scheme of multiple expansion types.When carrying out text extension, calculate
The text that corresponding text matrix can be met condition for consistence (such as matrix size deviation within a predetermined range) by platform uses
The expansion scheme of same expansion type is extended.Later, the text generation that computing platform can also be obtained according to extension extends
Sample.In one embodiment, the text extended can correspond to identical sample label with the text before extension, formed and expanded
Exhibit-sample sheet.In another embodiment, it can be to extend obtained text mark by marking model trained in advance, determine sample
This label, to generate extension sample.
Further, initial sample and extension sample can be trained NLP model together as training sample by computing platform,
And the current NLP model trained is assessed, and determined whether to export current NLP model according to assessment result.At one
In embodiment, assessment result is qualification, then exports "current" model.In another embodiment, assessment result is unqualified, then counts
The expansion scheme generation extension sample of initial sample can be reselected by calculating platform, until closing to the NLP model evaluation trained
Lattice.In this way, the training of NLP model can be automatically performed, to improve the validity of NLP model training.It is specifically described below logical
Cross the process that computer carries out natural language processing.
Fig. 2 shows the method flow diagrams of the natural language processing executed according to the computer of one embodiment.This method
Executing subject can be it is any there is calculating, the system of processing capacity, unit, platform or server, such as shown in Fig. 1
Computing platform etc..
If Fig. 2 shows, method includes the following steps: step 21, obtains multiple texts by manually marking as initial
Sample;Step 22, the corresponding each text matrix of each text in initial sample is determined;Step 23, in initial sample
Each text be extended respectively by the expansion scheme of at least one expansion type in pre-stored expansion scheme collection,
And sample is extended according to the text generation that extension obtains, wherein for each text in initial sample, text matrix meets one
The text of cause property condition corresponds to the expansion scheme of same expansion type;Step 24, by initial sample and extension sample collectively as
Training sample trains Natural Language Processing Models;Step 25, the Natural Language Processing Models trained are assessed, and commented
Estimate in the case that result meets predetermined condition, current Natural Language Processing Models is exported, to be used for natural language processing.
Firstly, obtaining multiple texts by manually marking as initial sample in step 21.It is appreciated that here
Described text can be an article, in short, etc..For each initial sample, including a text and a people
The label of work mark.The content of label can be determined according to the final usage scenario of NLP model and using purpose.For example,
NLP model is in the case where predicting text risk, label substance to can be " devoid of risk ", " risky ".Such as an initial sample
This includes text " I has had a breakfast downstairs ", and the label " devoid of risk " manually marked.
In some embodiments, the text in initial sample can also be by pretreated text.It can be in advance to text
This is segmented, stop words is gone to handle.Character in text, is exactly divided into word one by one by participle.For example, for text
" I has had a breakfast downstairs ", can be divided by dictionary trained in advance " I ", " I ", " ", " eating ", " ",
Word as " having ", " a ", " a morning ", " breakfast ", " meal " etc..Then, the stop words such as " ", " " are removed, " building is obtained
Under, have, breakfast, meal ... " as vocabulary.
It is appreciated that due to segmenting, removing after stop words effective vocabulary in only remaining text, in subsequent processing,
Data processing amount can be greatly reduced by only using effective vocabulary.It therefore, in some embodiments, can only will be in initial sample
Text after text participle, removal stop words carries out subsequent processing as process object.For convenience, this specification will
Text is still referred to as the text in initial sample by the vocabulary that pretreatment obtains.At this point, an initial sample may include text
" get up, have a meal, going window-shopping, having a meal, resting, having a meal ... " and sample label " devoid of risk ", etc..
Then, by step 22, the corresponding each text matrix of each text in initial sample is determined.It can manage
Solution, vocabulary and sentence can indicate that a text includes at least a sentence, then by each sentence expression by vector
For vector, entire text can be expressed as the matrix of multiple vector compositions.
As an example, each column can correspond to the weight of some vocabulary in each sentence in text matrix, the weight and word
The frequency of occurrence converged in corresponding sentence is positively correlated.By taking following corpus as an example:
Xiao Ming likes maple leaf, small red also to like maple leaf.
Xiao Ming likes playing soccer.
The vocabulary ranking results occurred in corpus are as follows: and Xiao Ming: 1, like: 2, maple leaf: 3, small red: 4, it kicks: 5, football:
6 }, then according to the frequency of occurrence of each vocabulary, obtain " Xiao Ming likes maple leaf, small red also like maple leaf " corresponding vector be 1,
2,2,1,0,0 }, the number in vector in each dimension respectively indicates vocabulary " Xiao Ming " and occurs 1 time, and " liking " occurs 2 times ...,
Then the corresponding matrix of above-mentioned corpus can be with are as follows:
It can be seen that this is the matrix of a 6*2.It can be appreciated that the length of the matrix is 6, width 2.
In this way, each text in initial sample may be converted into similar text matrix.
Then, in step 23, each text in initial sample is concentrated by pre-stored expansion scheme respectively
At least one expansion scheme is extended, and extends sample according to the text generation that extension obtains.It is worth noting that this step
In, for each text in initial sample, the text that text matrix meets condition for consistence is carried out using same expansion scheme
Extension.Expansion scheme collection can be pre-set, may include one or more expansion schemes.
According to a kind of embodiment, the expansion scheme that expansion scheme is concentrated may include term vector extension.For in text
Each vocabulary determines similar vocabulary by similarity, to be extended using similar vocabulary to text, the text being expanded
This.
In one embodiment, term vector may include the term vector of word-based insertion.Word is embedded in (word
Embedding) can in text vocabulary or phrase (words or phrases) be mapped as the vector being made of real number.
Based on context vocabulary by using Term co-occurrence matrix is embedded into the same vector space to word embedding by environment, is carried out
Weight calculation constructs co-occurrence matrix, so that it is determined that the term vector of each vocabulary.
If as an example, there is corpus as follows:
Xiao Ming likes maple leaf, small red to like maple leaf.
Xiao Ming likes playing soccer.
First the vocabulary in the above corpus is sorted out to come and sort.Specific principle of ordering can have very much, such as can
It is ranked up with first appearing sequence etc. according to vocabulary.Wherein, select vocabulary when can choose to the text in above-mentioned corpus into
Vocabulary in the above corpus can also be carried out term frequency-inverse document frequency index by the vocabulary after row word cutting, removal stop words
The calculating of TF-IDF selects TF-IDF value to arrange the vocabulary of forward predetermined number (such as 20,000).
Assuming that our vocabulary ranking results are as follows: Xiao Ming: 1, like: 2, maple leaf: 3, small red: 4, it kicks: 5, football: 6 },
So here 6 vocabulary can correspond to 6 dimensions of the above corpus.
A kind of vector expression of simple vocabulary may is that
Xiao Ming: [1,0,0,0,0,0]
Like: [0,1,0,0,0,0]
……
It is further that text representation is as follows at vector:
" Xiao Ming likes maple leaf.It is small red to like maple leaf.": [1,2,2,1,0,0] ...
Wherein each dimension is the frequency that corresponding vocabulary occurs in the text.However, this representation method is not often examined
Consider the association between vocabulary.
In order to embody the incidence relation between vocabulary, corpus document usually is indicated with co-occurrence matrix.As shown in table 1 below, row
Column infall is the number that each vocabulary occurs jointly in the above corpus.Such as, Xiao Ming and like occurring 2 times together.
The number that each vocabulary occurs jointly in 1 or more corpus of table
From table 1 it was determined that the co-occurrence matrix of the above corpus includes at least:
Wherein, each row in co-occurrence matrix (or column) corresponds to the term vector of corresponding vocabulary, such as " liking " corresponding word
Vector is [2,0,2,1,1,0].
In another embodiment, term vector may include the term vector based on stroke.Vocabulary or phrase can be pressed
It disassembles according to stroke into the vector being made of real number.Such as the word of each vocabulary can be determined according to the stroke writing of vocabulary or phrase
Vector.Specifically, will by stroke determine vocabulary (such as Chinese vocabulary) split into n member stroke, such as " cross ", " perpendicular ", " slash ",
" right-falling stroke " etc..Then, the term vector based on stroke is determined with being associated with for context according to split result.Term vector based on stroke
It can determine that details are not described herein by the algorithm of such as cw2vec etc.Optionally, stroke can also be replaced with to radical portion
First (combination of multiple strokes, such as " Xin ") determines term vector.
In yet another embodiment, term vector may include the term vector based on affixe.Affixe term vector be based on stroke
The term vector of (or radical) is similar, and by affixe (such as " ing ") as the benchmark split, details are not described herein.
In more embodiments, the term vector that can also be determined comprising more multi-method, such as word2vec, CNN
(Convolution neural network, convolutional neural networks), LSTM (long-short term memory, shot and long term
Memory models) etc. the term vector that determines of Supervised machine learnings method, or pass through the unsupervised machine learning method such as kmeans
Determining term vector etc., it is not limited here.In an alternate embodiment of the invention, expansion scheme may include above at least two word
Vector, the term vector that various term vectors determine that method determines can also be optionally combined into new vector, such as will be based on
The term vector of term vector and word-based insertion that word2vec method determines splices and combines into a vector.In this way, passing through difference
The vector of form splices, and can increase the intrinsic dimensionality of NLP model.
Further, it is also possible to expand in terms of vocabulary each text in initial sample.
It on the one hand, can be according to calculating vector similarity for some vocabulary of some text in each initial sample
Method determine Lexical Similarity.Here, for convenience of description, some text is properly termed as the first text, some vocabulary can
To be known as the first vocabulary.The similarity calculating method of term vector for example can be the side of cosine similarity, Jaccard coefficient etc
Method.By taking Jaccard coefficient as an example, it is assumed that the vector of vocabulary A is [1,0,0,1], and the vector of vocabulary B is [0,0,1,1], then vocabulary
The vector of the vector sum vocabulary B of A is all four dimensional vectors, and identical dimension is the second peacekeeping fourth dimension, the similarity of vocabulary A and vocabulary B
It can be with are as follows: same dimension/total dimension=2/ (4+4).It is greater than in advance it is then possible to take with the similarity of former vocabulary (such as the first vocabulary)
Similar vocabulary of the vocabulary as the vocabulary for determining Lexical Similarity threshold value can also take biggish preceding pre- with the similarity of former vocabulary
Determine similar vocabulary of the vocabulary as the vocabulary of number, it is not limited here.In this way, can be by by the text in initial sample
Vocabulary replace with similar vocabulary, form expanded text.
It is worth noting that in the embodiment of this specification, it, can in order to guarantee the validity and diversity of text extension
To collect various types of corpus as much as possible, to expand basic corpus.Such as it can be collected from social tools such as microbloggings
Colloquial description corpus can also collect professional description corpus from wikipedia etc..
On the other hand, expansion scheme can also include that multilingual translation extends.Any text in initial sample is known as
First text, it is assumed that first text is described by first language (such as Chinese), the first text can be passed through language conversion mould
Type translates into the second text of second language description, then the second text is translated into the third text of first language description, according to
The third text determines the text that extension obtains.As an example, for the first text " data set dilatation " of Chinese description, it can be with
The second text " Data set expansion " for first translating into English description, then again by second text " Data set
Expansion " translates into the third text " data set extension " of Chinese description.At this point it is possible to by the third text, " data set expands
The text that exhibition " is obtained as extension.It is worth noting that second language can be the various language in addition to first language.Also,
When a kind of language translation is another language, translation result is not necessarily unique.That is, the first text is also in above-mentioned example
At least one text of the description such as German, French can be translated into, here, German, French etc. can become the second language
Speech, the second text are any one in the text for the second language description that translation obtains.On the other hand, second language is described
The second text when being translated back into text (the third text) of first language description again, text can not also be unique.In this way, by more
Text can be extended to the expression of various semantic similarities by the language translation of sample.
It is appreciated that respectively corresponding a text matrix for each text in initial sample.If two texts
Text matrix is with uniformity, then may be that text matrix size is similar, it is also possible to which two text semantics are close.Further
Ground can use identical expansion scheme when being extended.In this way, the condition for consistence of text matrix can be preset, come
Detect the consistency of the text matrix between text.
In one embodiment, above-mentioned condition for consistence can be, and the deviation of text matrix is all within a predetermined range.?
That is length and width is identical or length and width all relatively.The length of the text matrix of two texts and
Width is all consistent, then the two texts at least vocabulary quantity (or keyword quantity) is identical, and sentence quantity is identical.And two matrixes
All relatively, the deviation of length and width is all in preset range (such as less than 5%) in other words for length and width.Example
Such as, text matrix is 100*100, another text matrix is 98*98, then the length and width deviation of the two matrixes is all
It can be (100-98)/100, or (100-98)/98.If preset range is less than 5%, it may be considered that the two matrixes
It is consistent.
In another embodiment, above-mentioned condition for consistence can be, and matrix similarity is greater than predetermined matrices similarity threshold
Value.Wherein, matrix similarity can be determined by any matrix similarity based method.For example, can the flattening of two matrixes at
Two vectors determine the similarity of two vectors, as corresponding using the method for all Jaccard coefficients as the aforementioned etc
The similarity of two matrixes.For another example the matrix that the element of two matrix corresponding positions can be subtracted each other, and obtained after subtracting each other
In each element, it is less than the element number of predetermined value (such as 1) and the ratio of the total element format of the matrix, as corresponding two squares
The similarity of battle array.The similarity of two text matrixes is more than predetermined matrices similarity threshold (such as 90%), then two matrixes are similar.
It is appreciated that each element may represent meaning expressed by text, the vocabulary used etc. for text matrix,
If it is determined that using identical corpus as foundation when text matrix, then the meaning of the element of text matrix same position may be one
It causes, then the value of element is closer, and the semanteme of two texts is more similar.For semantic more similar text, can be used identical
Expansion scheme be extended, be easy to use the known experience extended.
In yet another embodiment, above-mentioned condition for consistence can also be cluster to same category.It specifically, can be according to
The value of the specific element of text matrix clusters each text in initial sample.Such as text matrix can be regarded
One hyperspace, can be the point in each text matrix mapped bits hyperspace, Jin Ertong according to specifically element value
The clustering methods such as Euclidean distance are crossed to cluster each text in initial sample.It is appreciated that clustering to same category of
Each text is consistent text.Optionally, can also currently will expand to expanded text and history in cluster process
The exhibition preferable text of effect is clustered, so that preferably extending experience using history extension effect carries out text extension, efficiency
It is higher.
After text in each initial sample is extended, the text generation that can also further obtain extension is expanded
Exhibit-sample sheet.Here, the text that an initial sample extends can be one or more.
In one embodiment, corresponding expanded text can be marked with the sample label of initial sample, be expanded text
This sample label forms extension sample.In this way, higher to the formation efficiency of extension sample.
In another embodiment, it can use marking model trained in advance and marked to obtained text is extended
Note forms extension sample.Wherein, annotation results are exactly to extend the sample label that obtained text is marked.Marking model can
It is trained with the sample using artificial mark.It is understood that, on the one hand, the sample accuracy of artificial mark is higher, another party
The sample label of face, the text after extension and the text in initial sample may change, such as risky text passes through
Extension becomes devoid of risk text, therefore, is labeled, is tied to obtained text is extended using marking model trained in advance
Fruit may be more acurrate.
Then, by step 24, the extension sample generated in the initial sample and step 23 that obtain in step 21 is common
As training sample, training Natural Language Processing Models.Wherein, model training process can be carried out using those skilled in the art
The various methods of model training carry out, and details are not described herein.
Further, in step 25, the Natural Language Processing Models trained are assessed, and are met in assessment result
In the case where predetermined condition, current Natural Language Processing Models are exported, to be used for natural language processing.Wherein, to natural language
Processing model is assessed can for example be carried out by least one of following: accuracy rate, recall rate, AUC (Area Under
Curve, the area surrounded with reference axis under ROC curve), F1score (weighted average of F1 score, accuracy rate and recall rate).
Correspondingly, above-mentioned predetermined condition can be evaluation index greater than corresponding predetermined threshold.As an example, as returned the natural language trained
The AUC of speech processing model is greater than first threshold, alternatively, predetermined condition can also be that the F1score of Natural Language Processing Models is big
In predetermined threshold.Estimating result is to meet predetermined condition, otherwise it is unqualified that model is qualified.In one embodiment, natural language
The assessment of processing model can be carried out by the test set of artificial mark.
Further, in the case where assessment result is to meet predetermined condition, current natural language processing can be exported
Model, to be used for natural language processing.In the case where assessment result is to be unsatisfactory for predetermined condition, step 23-25 can be repeated,
Until assessment result is to meet predetermined condition.
In order to more clearly describe above procedure, below with reference to Fig. 3, it is illustrated with an implement scene.In the implementation
In scene, need to establish the NLP model of text risk profile.In the scene shown in Fig. 3, the natural language processing of the present embodiment
Method can be applied to a NLP model and generate system.In this scenario, it is imported first by user multiple by manually marking
Text is as initial sample.After NLP model generation system obtains these initial samples, a choice box is shown, for user's choosing
Select the text expansion scheme to initial sample, such as term vector extension, translation extension etc..After user clicks every expansion scheme
The round choice box in face is clicked, and is clicked " confirmation " button later and is entered in next step.Wherein, user can not also select, and
The Auto button of lower right is clicked, carries out text extension so that NLP model generates system adjust automatically expansion scheme.Later,
NLP model generates system can also show the type selecting of " artificial mark " and " automatic marking " to user, for the text after extension
Artificial mark is carried out, or generates the marking model automatic marking of the system integration according to NLP model, to generate extension sample.
Then, NLP model generates system using initial sample and extension sample training Natural Language Processing Models, and to trained model
It is assessed.In the case where assessment result is to meet predetermined condition, current Natural Language Processing Models are exported.Wherein, in advance
Fixed condition can also click determination by user.In the case where assessment result is to be unsatisfactory for predetermined condition, NLP model generates system
User can also be reminded to click expansion scheme again, and repeat subsequent process, until assessment result is to meet predetermined condition.Such as
This, after a small amount of initial sample of user's input, whole process generates system by NLP model and executes automatically, and user can root
According to demand flexible choice.
Above procedure is looked back, during natural language processing, by the way that initial sample is extended the sample that is expanded, and
By initial sample and extension sample together as training sample training Natural Language Processing Models, then according to current natural language
The assessment of speech processing model determines whether to export current Natural Language Processing Models.Wherein, when being extended to initial sample, needle
To each text in initial sample, the consistent text of text matrix size corresponds to identical expansion scheme, improves efficiency.It is another
Aspect, a variety of expansion schemes are combined, and increase the comprehensive of text extension.In this way, can use Natural Language Processing Models
Rule present in modeling is automatically performed the modeling of Natural Language Processing Models, and adds up outstanding warp by model evaluation mechanism
It tests, to improve the validity of natural language processing.
According to the embodiment of another aspect, a kind of device of natural language processing is also provided.Fig. 4 is shown to be implemented according to one
The schematic block diagram of the device of the natural language processing of example.As shown in figure 4, the device 400 of natural language processing includes: to obtain list
Member 41 is configured to obtain multiple texts by manually marking as initial sample;Determination unit 42 is configured to determine initial sample
The corresponding each text matrix of each text in this;Expanding element 43 is configured to each text in initial sample point
It is not extended by least one expansion scheme in pre-stored expansion scheme collection, and raw according to the text that extension obtains
At extension sample, wherein for each text in initial sample, the text that text matrix meets condition for consistence corresponds to identical
Expansion scheme;Training unit 44, by initial sample and extension sample collectively as training sample training natural language processing mould
Type;Assessment unit 45 assesses the Natural Language Processing Models trained, and meets the feelings of predetermined condition in assessment result
Under condition, current Natural Language Processing Models are exported, to be used for natural language processing.
In one implementation, by pretreated text that the multiple texts manually marked are by segmenting, removing stop words
This.
According to a kind of embodiment, in expanding element 43 preset condition for consistence can include but is not limited to it is following at least
One:
The deviation of corresponding text matrix length and width is all within a predetermined range;
Matrix similarity is greater than predetermined matrices similarity threshold;
Cluster same category.
In one embodiment, the expansion scheme in expanding element 43 may include term vector extension, the term vector packet
Include following at least one: the term vector of word-based insertion, the term vector based on stroke, the term vector based on affixe.
In a further embodiment, it is assumed that multiple texts in initial sample include the first text, are wrapped in the first text
Include the first vocabulary, wherein the first text can be any one text of initial sample, and the first vocabulary can be in the first text
Any vocabulary.At this point, expanding element 43 is also configured as:
For the first vocabulary, detects in corpus and be greater than predetermined word with the presence or absence of with the similarity of the term vector of the first vocabulary
The similar vocabulary of remittance similarity threshold;
There are similar vocabulary, the first vocabulary is replaced with similar vocabulary, to be extended to the first text.
In another embodiment, the expansion scheme in expanding element 43 can also include that multilingual translation extends.Assuming that
Initial sample includes the first text, and the first text is described by first language (such as Chinese), then expanding element 43 can also configure
Are as follows:
By the first text by language conversion model translation at the second text described in second language (such as English);
By the second text by language conversion model translation at the third text described by first language;
The text that extension obtains is determined according to third text.
In a possible design, expanding element 43 is also configured as:
It is labeled by marking model trained in advance to obtained text is extended, and by annotation results and corresponding
It extends obtained text and is formed together extension sample.
In one embodiment, the predetermined condition in assessment unit 45 can include but is not limited at least one of following:
The area under the curve AUC for the Natural Language Processing Models trained is greater than first threshold;
The balance F score for the Natural Language Processing Models trained is greater than second threshold, wherein balance F score is accurate
The weighted average of rate and recall rate.
It is worth noting that device 400 shown in Fig. 4 be with Fig. 2 shows the corresponding device of embodiment of the method implement
Example, Fig. 2 shows embodiment of the method in it is corresponding describe be equally applicable to device 400, details are not described herein.
By apparatus above, it can choose a variety of expansion schemes and initial sample be extended, and text matrix size
The text for meeting condition for consistence corresponds to identical expansion scheme.Expansion scheme is determined by the assessment to the model trained
It is whether suitable.In this way, the validity of natural language processing can be improved.
According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey
Sequence enables computer execute method described in conjunction with Figure 2 when the computer program executes in a computer.
According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided
In be stored with executable code, when the processor executes the executable code, realize the method in conjunction with described in Fig. 2.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention
It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions
Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all
Including within protection scope of the present invention.