CN110008335A - The method and device of natural language processing - Google Patents

The method and device of natural language processing Download PDF

Info

Publication number
CN110008335A
CN110008335A CN201811519721.8A CN201811519721A CN110008335A CN 110008335 A CN110008335 A CN 110008335A CN 201811519721 A CN201811519721 A CN 201811519721A CN 110008335 A CN110008335 A CN 110008335A
Authority
CN
China
Prior art keywords
text
extension
natural language
sample
language processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811519721.8A
Other languages
Chinese (zh)
Inventor
袁锦程
王维强
许辽萨
赵闻飙
叶芸
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201811519721.8A priority Critical patent/CN110008335A/en
Publication of CN110008335A publication Critical patent/CN110008335A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

This specification embodiment provides a kind of method and apparatus of natural language processing that computer executes, according to this method embodiment, initial sample is based on text matrix-expand to be expanded sample, and by initial sample and extension sample together as training sample training Natural Language Processing Models, then basis determines whether the assessment of current Natural Language Processing Models to export current Natural Language Processing Models.Wherein, initial sample is extended by the plurality of optional scheme that expansion scheme is concentrated, various expansion schemes can be combined with each other, and it is comprehensive to improve extension;And the text that text matrix size meets condition for consistence corresponds to identical expansion scheme, improves expansion efficiency.In short, the validity of natural language processing can be improved in the embodiment.

Description

The method and device of natural language processing
Technical field
This specification one or more embodiment is related to field of computer technology, more particularly to carries out nature by computer The method and apparatus of Language Processing.
Background technique
With the development of artificial intelligence technology, the application of natural language processing is also more and more.For example, being sent out for internet The risk prevention system of cloth content, carry out public sentiment control, anti-fraud, it is cross-border limit sell, anti money washing, text garbage etc. scene, just need The risk of text is identified by natural language processing.And under different scenes, what is needed may not also from speech language model Together.In routine techniques, often begun setting up from blank from new natural language under corresponding language frame for different scenes Manage (NLP) model.In this modeling pattern, identical process is repeated.For example, re-establishing corpus, manual expanding data Collection, etc..
Summary of the invention
This specification one or more embodiment describes a kind of natural language processing method and apparatus that computer executes, Using Natural Language Processing Models model present in rule, be automatically performed the modeling of Natural Language Processing Models, and pass through mould Type evaluation mechanism adds up outstanding experience, to improve the validity of natural language processing.
According in a first aspect, providing a kind of method of natural language processing that computer executes, which comprises obtain Take multiple texts by manually marking as initial sample;Determine that each text is corresponding each in the initial sample Text matrix;Each text in the initial sample is expanded by least one of pre-stored expansion scheme collection respectively Exhibition scheme is extended, and extends sample according to the text generation that extension obtains, wherein for each in the initial sample Text, the text that text matrix meets condition for consistence correspond to identical expansion scheme;By the initial sample and the extension Sample is collectively as training sample training Natural Language Processing Models;The Natural Language Processing Models trained are assessed, And in the case where assessment result meets predetermined condition, current Natural Language Processing Models are exported, to be used for natural language processing.
In one embodiment, described by the multiple texts manually marked is pre- place by segmenting, removing stop words The text of reason.
In one embodiment, the condition for consistence includes at least one of the following:
The deviation of corresponding text matrix length and width is all within a predetermined range;
Matrix similarity is greater than predetermined matrices similarity threshold;
Cluster same category.
In one embodiment, the expansion scheme includes term vector extension, and the term vector comprises at least one of the following: The term vector of word-based insertion, the term vector based on stroke, the term vector based on affixe.
In one embodiment, multiple texts in the initial sample include the first text, and first text includes First vocabulary;And
Each text in the initial sample passes through at least one expansion scheme in expansion scheme collection respectively It is extended and includes:
For first vocabulary, detect in corpus with the presence or absence of big with the similarity of the term vector of first vocabulary In the similar vocabulary of predetermined vocabulary similarity threshold;
There are the similar vocabulary, first vocabulary is replaced with the similar vocabulary, to described the One text is extended.
In one embodiment, the expansion scheme includes multilingual translation extension, multiple texts in the initial sample This includes the first text, and first text is described by first language;
Each text in the initial sample passes through at least one expansion scheme in expansion scheme collection respectively It is extended and includes:
First text is passed through into the second text that language conversion model translation is described at second language;
By second text by language conversion model translation at the third text described by the first language;
The text that extension obtains is determined according to the third text.
In one embodiment, the text generation extension sample that extension is obtained includes:
It is labeled by marking model trained in advance to obtained text is extended, and by annotation results and corresponding It extends obtained text and is formed together extension sample.
In one embodiment, the predetermined condition includes at least one of the following:
The area under the curve AUC for the Natural Language Processing Models trained is greater than first threshold;
The balance F score for the Natural Language Processing Models trained is greater than second threshold, wherein the balance F score is The weighted average of accuracy rate and recall rate.
According to second aspect, a kind of device of natural language processing is provided, comprising:
Acquiring unit is configured to obtain multiple texts by manually marking as initial sample;
Determination unit is configured to determine the corresponding each text matrix of each text in the initial sample;
Expanding element is configured to each text in the initial sample respectively by pre-stored expansion scheme collection In at least one expansion scheme be extended, and sample is extended according to the obtained text generation of extension, wherein for it is described just Each text in beginning sample, the text that text matrix meets condition for consistence correspond to identical expansion scheme;
Training unit, by the initial sample and the extension sample collectively as training sample training natural language processing Model;
Assessment unit assesses the Natural Language Processing Models trained, and meets predetermined condition in assessment result In the case where, current Natural Language Processing Models are exported, to be used for natural language processing.
According to the third aspect, a kind of computer readable storage medium is provided, computer program is stored thereon with, when described When computer program executes in a computer, enable computer execute first aspect method.
According to fourth aspect, a kind of calculating equipment, including memory and processor are provided, which is characterized in that described to deposit It is stored with executable code in reservoir, when the processor executes the executable code, the method for realizing first aspect.
The method and apparatus for the natural language processing that the computer that this specification embodiment provides executes, by by initial sample This is expanded sample according to text matrix-expand, and by initial sample and extension sample together as training sample training nature Language Processing model, then basis determines whether the assessment of current Natural Language Processing Models to export current natural language processing Model.Wherein, initial sample is extended by the plurality of optional scheme that expansion scheme is concentrated, various expansion schemes can phase Mutually combination, and text matrix size meets the text of condition for consistence and corresponds to identical expansion scheme.On the one hand extension improves Efficiency, on the other hand, raising extension are comprehensive, in another aspect, can assess spreading result.In this way, can be improved certainly The validity of right Language Processing.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.
Fig. 1 shows the implementation configuration diagram of this specification embodiment;
Fig. 2 shows the flow charts of the method for the natural language processing executed according to the computer of one embodiment;
Fig. 3 shows the implement scene schematic diagram of the one embodiment disclosed according to this specification;
Fig. 4 shows the schematic block diagram of the device of the natural language processing according to one embodiment.
Specific embodiment
With reference to the accompanying drawing, the scheme provided this specification is described.
Fig. 1 is an implementation configuration diagram of this specification embodiment.In the implementation framework, natural language processing Method is executed by the computing platform in Fig. 1.The computing platform can be set to have computing capability various equipment (computer) or In device clusters.In the implementation framework, computing platform multiple texts by manually marking available first are used as initially Sample.It is appreciated that an initial sample corresponds to a text, and the sample label manually marked.Further, for each A text, computing platform can also convert it into text matrix.Then, computing platform can be for each in initial sample Text is extended.Computing platform can be previously stored with the expansion scheme of multiple expansion types.When carrying out text extension, calculate The text that corresponding text matrix can be met condition for consistence (such as matrix size deviation within a predetermined range) by platform uses The expansion scheme of same expansion type is extended.Later, the text generation that computing platform can also be obtained according to extension extends Sample.In one embodiment, the text extended can correspond to identical sample label with the text before extension, formed and expanded Exhibit-sample sheet.In another embodiment, it can be to extend obtained text mark by marking model trained in advance, determine sample This label, to generate extension sample.
Further, initial sample and extension sample can be trained NLP model together as training sample by computing platform, And the current NLP model trained is assessed, and determined whether to export current NLP model according to assessment result.At one In embodiment, assessment result is qualification, then exports "current" model.In another embodiment, assessment result is unqualified, then counts The expansion scheme generation extension sample of initial sample can be reselected by calculating platform, until closing to the NLP model evaluation trained Lattice.In this way, the training of NLP model can be automatically performed, to improve the validity of NLP model training.It is specifically described below logical Cross the process that computer carries out natural language processing.
Fig. 2 shows the method flow diagrams of the natural language processing executed according to the computer of one embodiment.This method Executing subject can be it is any there is calculating, the system of processing capacity, unit, platform or server, such as shown in Fig. 1 Computing platform etc..
If Fig. 2 shows, method includes the following steps: step 21, obtains multiple texts by manually marking as initial Sample;Step 22, the corresponding each text matrix of each text in initial sample is determined;Step 23, in initial sample Each text be extended respectively by the expansion scheme of at least one expansion type in pre-stored expansion scheme collection, And sample is extended according to the text generation that extension obtains, wherein for each text in initial sample, text matrix meets one The text of cause property condition corresponds to the expansion scheme of same expansion type;Step 24, by initial sample and extension sample collectively as Training sample trains Natural Language Processing Models;Step 25, the Natural Language Processing Models trained are assessed, and commented Estimate in the case that result meets predetermined condition, current Natural Language Processing Models is exported, to be used for natural language processing.
Firstly, obtaining multiple texts by manually marking as initial sample in step 21.It is appreciated that here Described text can be an article, in short, etc..For each initial sample, including a text and a people The label of work mark.The content of label can be determined according to the final usage scenario of NLP model and using purpose.For example, NLP model is in the case where predicting text risk, label substance to can be " devoid of risk ", " risky ".Such as an initial sample This includes text " I has had a breakfast downstairs ", and the label " devoid of risk " manually marked.
In some embodiments, the text in initial sample can also be by pretreated text.It can be in advance to text This is segmented, stop words is gone to handle.Character in text, is exactly divided into word one by one by participle.For example, for text " I has had a breakfast downstairs ", can be divided by dictionary trained in advance " I ", " I ", " ", " eating ", " ", Word as " having ", " a ", " a morning ", " breakfast ", " meal " etc..Then, the stop words such as " ", " " are removed, " building is obtained Under, have, breakfast, meal ... " as vocabulary.
It is appreciated that due to segmenting, removing after stop words effective vocabulary in only remaining text, in subsequent processing, Data processing amount can be greatly reduced by only using effective vocabulary.It therefore, in some embodiments, can only will be in initial sample Text after text participle, removal stop words carries out subsequent processing as process object.For convenience, this specification will Text is still referred to as the text in initial sample by the vocabulary that pretreatment obtains.At this point, an initial sample may include text " get up, have a meal, going window-shopping, having a meal, resting, having a meal ... " and sample label " devoid of risk ", etc..
Then, by step 22, the corresponding each text matrix of each text in initial sample is determined.It can manage Solution, vocabulary and sentence can indicate that a text includes at least a sentence, then by each sentence expression by vector For vector, entire text can be expressed as the matrix of multiple vector compositions.
As an example, each column can correspond to the weight of some vocabulary in each sentence in text matrix, the weight and word The frequency of occurrence converged in corresponding sentence is positively correlated.By taking following corpus as an example:
Xiao Ming likes maple leaf, small red also to like maple leaf.
Xiao Ming likes playing soccer.
The vocabulary ranking results occurred in corpus are as follows: and Xiao Ming: 1, like: 2, maple leaf: 3, small red: 4, it kicks: 5, football: 6 }, then according to the frequency of occurrence of each vocabulary, obtain " Xiao Ming likes maple leaf, small red also like maple leaf " corresponding vector be 1, 2,2,1,0,0 }, the number in vector in each dimension respectively indicates vocabulary " Xiao Ming " and occurs 1 time, and " liking " occurs 2 times ..., Then the corresponding matrix of above-mentioned corpus can be with are as follows:
It can be seen that this is the matrix of a 6*2.It can be appreciated that the length of the matrix is 6, width 2.
In this way, each text in initial sample may be converted into similar text matrix.
Then, in step 23, each text in initial sample is concentrated by pre-stored expansion scheme respectively At least one expansion scheme is extended, and extends sample according to the text generation that extension obtains.It is worth noting that this step In, for each text in initial sample, the text that text matrix meets condition for consistence is carried out using same expansion scheme Extension.Expansion scheme collection can be pre-set, may include one or more expansion schemes.
According to a kind of embodiment, the expansion scheme that expansion scheme is concentrated may include term vector extension.For in text Each vocabulary determines similar vocabulary by similarity, to be extended using similar vocabulary to text, the text being expanded This.
In one embodiment, term vector may include the term vector of word-based insertion.Word is embedded in (word Embedding) can in text vocabulary or phrase (words or phrases) be mapped as the vector being made of real number. Based on context vocabulary by using Term co-occurrence matrix is embedded into the same vector space to word embedding by environment, is carried out Weight calculation constructs co-occurrence matrix, so that it is determined that the term vector of each vocabulary.
If as an example, there is corpus as follows:
Xiao Ming likes maple leaf, small red to like maple leaf.
Xiao Ming likes playing soccer.
First the vocabulary in the above corpus is sorted out to come and sort.Specific principle of ordering can have very much, such as can It is ranked up with first appearing sequence etc. according to vocabulary.Wherein, select vocabulary when can choose to the text in above-mentioned corpus into Vocabulary in the above corpus can also be carried out term frequency-inverse document frequency index by the vocabulary after row word cutting, removal stop words The calculating of TF-IDF selects TF-IDF value to arrange the vocabulary of forward predetermined number (such as 20,000).
Assuming that our vocabulary ranking results are as follows: Xiao Ming: 1, like: 2, maple leaf: 3, small red: 4, it kicks: 5, football: 6 }, So here 6 vocabulary can correspond to 6 dimensions of the above corpus.
A kind of vector expression of simple vocabulary may is that
Xiao Ming: [1,0,0,0,0,0]
Like: [0,1,0,0,0,0]
……
It is further that text representation is as follows at vector:
" Xiao Ming likes maple leaf.It is small red to like maple leaf.": [1,2,2,1,0,0] ...
Wherein each dimension is the frequency that corresponding vocabulary occurs in the text.However, this representation method is not often examined Consider the association between vocabulary.
In order to embody the incidence relation between vocabulary, corpus document usually is indicated with co-occurrence matrix.As shown in table 1 below, row Column infall is the number that each vocabulary occurs jointly in the above corpus.Such as, Xiao Ming and like occurring 2 times together.
The number that each vocabulary occurs jointly in 1 or more corpus of table
From table 1 it was determined that the co-occurrence matrix of the above corpus includes at least:
Wherein, each row in co-occurrence matrix (or column) corresponds to the term vector of corresponding vocabulary, such as " liking " corresponding word Vector is [2,0,2,1,1,0].
In another embodiment, term vector may include the term vector based on stroke.Vocabulary or phrase can be pressed It disassembles according to stroke into the vector being made of real number.Such as the word of each vocabulary can be determined according to the stroke writing of vocabulary or phrase Vector.Specifically, will by stroke determine vocabulary (such as Chinese vocabulary) split into n member stroke, such as " cross ", " perpendicular ", " slash ", " right-falling stroke " etc..Then, the term vector based on stroke is determined with being associated with for context according to split result.Term vector based on stroke It can determine that details are not described herein by the algorithm of such as cw2vec etc.Optionally, stroke can also be replaced with to radical portion First (combination of multiple strokes, such as " Xin ") determines term vector.
In yet another embodiment, term vector may include the term vector based on affixe.Affixe term vector be based on stroke The term vector of (or radical) is similar, and by affixe (such as " ing ") as the benchmark split, details are not described herein.
In more embodiments, the term vector that can also be determined comprising more multi-method, such as word2vec, CNN (Convolution neural network, convolutional neural networks), LSTM (long-short term memory, shot and long term Memory models) etc. the term vector that determines of Supervised machine learnings method, or pass through the unsupervised machine learning method such as kmeans Determining term vector etc., it is not limited here.In an alternate embodiment of the invention, expansion scheme may include above at least two word Vector, the term vector that various term vectors determine that method determines can also be optionally combined into new vector, such as will be based on The term vector of term vector and word-based insertion that word2vec method determines splices and combines into a vector.In this way, passing through difference The vector of form splices, and can increase the intrinsic dimensionality of NLP model.
Further, it is also possible to expand in terms of vocabulary each text in initial sample.
It on the one hand, can be according to calculating vector similarity for some vocabulary of some text in each initial sample Method determine Lexical Similarity.Here, for convenience of description, some text is properly termed as the first text, some vocabulary can To be known as the first vocabulary.The similarity calculating method of term vector for example can be the side of cosine similarity, Jaccard coefficient etc Method.By taking Jaccard coefficient as an example, it is assumed that the vector of vocabulary A is [1,0,0,1], and the vector of vocabulary B is [0,0,1,1], then vocabulary The vector of the vector sum vocabulary B of A is all four dimensional vectors, and identical dimension is the second peacekeeping fourth dimension, the similarity of vocabulary A and vocabulary B It can be with are as follows: same dimension/total dimension=2/ (4+4).It is greater than in advance it is then possible to take with the similarity of former vocabulary (such as the first vocabulary) Similar vocabulary of the vocabulary as the vocabulary for determining Lexical Similarity threshold value can also take biggish preceding pre- with the similarity of former vocabulary Determine similar vocabulary of the vocabulary as the vocabulary of number, it is not limited here.In this way, can be by by the text in initial sample Vocabulary replace with similar vocabulary, form expanded text.
It is worth noting that in the embodiment of this specification, it, can in order to guarantee the validity and diversity of text extension To collect various types of corpus as much as possible, to expand basic corpus.Such as it can be collected from social tools such as microbloggings Colloquial description corpus can also collect professional description corpus from wikipedia etc..
On the other hand, expansion scheme can also include that multilingual translation extends.Any text in initial sample is known as First text, it is assumed that first text is described by first language (such as Chinese), the first text can be passed through language conversion mould Type translates into the second text of second language description, then the second text is translated into the third text of first language description, according to The third text determines the text that extension obtains.As an example, for the first text " data set dilatation " of Chinese description, it can be with The second text " Data set expansion " for first translating into English description, then again by second text " Data set Expansion " translates into the third text " data set extension " of Chinese description.At this point it is possible to by the third text, " data set expands The text that exhibition " is obtained as extension.It is worth noting that second language can be the various language in addition to first language.Also, When a kind of language translation is another language, translation result is not necessarily unique.That is, the first text is also in above-mentioned example At least one text of the description such as German, French can be translated into, here, German, French etc. can become the second language Speech, the second text are any one in the text for the second language description that translation obtains.On the other hand, second language is described The second text when being translated back into text (the third text) of first language description again, text can not also be unique.In this way, by more Text can be extended to the expression of various semantic similarities by the language translation of sample.
It is appreciated that respectively corresponding a text matrix for each text in initial sample.If two texts Text matrix is with uniformity, then may be that text matrix size is similar, it is also possible to which two text semantics are close.Further Ground can use identical expansion scheme when being extended.In this way, the condition for consistence of text matrix can be preset, come Detect the consistency of the text matrix between text.
In one embodiment, above-mentioned condition for consistence can be, and the deviation of text matrix is all within a predetermined range.? That is length and width is identical or length and width all relatively.The length of the text matrix of two texts and Width is all consistent, then the two texts at least vocabulary quantity (or keyword quantity) is identical, and sentence quantity is identical.And two matrixes All relatively, the deviation of length and width is all in preset range (such as less than 5%) in other words for length and width.Example Such as, text matrix is 100*100, another text matrix is 98*98, then the length and width deviation of the two matrixes is all It can be (100-98)/100, or (100-98)/98.If preset range is less than 5%, it may be considered that the two matrixes It is consistent.
In another embodiment, above-mentioned condition for consistence can be, and matrix similarity is greater than predetermined matrices similarity threshold Value.Wherein, matrix similarity can be determined by any matrix similarity based method.For example, can the flattening of two matrixes at Two vectors determine the similarity of two vectors, as corresponding using the method for all Jaccard coefficients as the aforementioned etc The similarity of two matrixes.For another example the matrix that the element of two matrix corresponding positions can be subtracted each other, and obtained after subtracting each other In each element, it is less than the element number of predetermined value (such as 1) and the ratio of the total element format of the matrix, as corresponding two squares The similarity of battle array.The similarity of two text matrixes is more than predetermined matrices similarity threshold (such as 90%), then two matrixes are similar. It is appreciated that each element may represent meaning expressed by text, the vocabulary used etc. for text matrix, If it is determined that using identical corpus as foundation when text matrix, then the meaning of the element of text matrix same position may be one It causes, then the value of element is closer, and the semanteme of two texts is more similar.For semantic more similar text, can be used identical Expansion scheme be extended, be easy to use the known experience extended.
In yet another embodiment, above-mentioned condition for consistence can also be cluster to same category.It specifically, can be according to The value of the specific element of text matrix clusters each text in initial sample.Such as text matrix can be regarded One hyperspace, can be the point in each text matrix mapped bits hyperspace, Jin Ertong according to specifically element value The clustering methods such as Euclidean distance are crossed to cluster each text in initial sample.It is appreciated that clustering to same category of Each text is consistent text.Optionally, can also currently will expand to expanded text and history in cluster process The exhibition preferable text of effect is clustered, so that preferably extending experience using history extension effect carries out text extension, efficiency It is higher.
After text in each initial sample is extended, the text generation that can also further obtain extension is expanded Exhibit-sample sheet.Here, the text that an initial sample extends can be one or more.
In one embodiment, corresponding expanded text can be marked with the sample label of initial sample, be expanded text This sample label forms extension sample.In this way, higher to the formation efficiency of extension sample.
In another embodiment, it can use marking model trained in advance and marked to obtained text is extended Note forms extension sample.Wherein, annotation results are exactly to extend the sample label that obtained text is marked.Marking model can It is trained with the sample using artificial mark.It is understood that, on the one hand, the sample accuracy of artificial mark is higher, another party The sample label of face, the text after extension and the text in initial sample may change, such as risky text passes through Extension becomes devoid of risk text, therefore, is labeled, is tied to obtained text is extended using marking model trained in advance Fruit may be more acurrate.
Then, by step 24, the extension sample generated in the initial sample and step 23 that obtain in step 21 is common As training sample, training Natural Language Processing Models.Wherein, model training process can be carried out using those skilled in the art The various methods of model training carry out, and details are not described herein.
Further, in step 25, the Natural Language Processing Models trained are assessed, and are met in assessment result In the case where predetermined condition, current Natural Language Processing Models are exported, to be used for natural language processing.Wherein, to natural language Processing model is assessed can for example be carried out by least one of following: accuracy rate, recall rate, AUC (Area Under Curve, the area surrounded with reference axis under ROC curve), F1score (weighted average of F1 score, accuracy rate and recall rate). Correspondingly, above-mentioned predetermined condition can be evaluation index greater than corresponding predetermined threshold.As an example, as returned the natural language trained The AUC of speech processing model is greater than first threshold, alternatively, predetermined condition can also be that the F1score of Natural Language Processing Models is big In predetermined threshold.Estimating result is to meet predetermined condition, otherwise it is unqualified that model is qualified.In one embodiment, natural language The assessment of processing model can be carried out by the test set of artificial mark.
Further, in the case where assessment result is to meet predetermined condition, current natural language processing can be exported Model, to be used for natural language processing.In the case where assessment result is to be unsatisfactory for predetermined condition, step 23-25 can be repeated, Until assessment result is to meet predetermined condition.
In order to more clearly describe above procedure, below with reference to Fig. 3, it is illustrated with an implement scene.In the implementation In scene, need to establish the NLP model of text risk profile.In the scene shown in Fig. 3, the natural language processing of the present embodiment Method can be applied to a NLP model and generate system.In this scenario, it is imported first by user multiple by manually marking Text is as initial sample.After NLP model generation system obtains these initial samples, a choice box is shown, for user's choosing Select the text expansion scheme to initial sample, such as term vector extension, translation extension etc..After user clicks every expansion scheme The round choice box in face is clicked, and is clicked " confirmation " button later and is entered in next step.Wherein, user can not also select, and The Auto button of lower right is clicked, carries out text extension so that NLP model generates system adjust automatically expansion scheme.Later, NLP model generates system can also show the type selecting of " artificial mark " and " automatic marking " to user, for the text after extension Artificial mark is carried out, or generates the marking model automatic marking of the system integration according to NLP model, to generate extension sample. Then, NLP model generates system using initial sample and extension sample training Natural Language Processing Models, and to trained model It is assessed.In the case where assessment result is to meet predetermined condition, current Natural Language Processing Models are exported.Wherein, in advance Fixed condition can also click determination by user.In the case where assessment result is to be unsatisfactory for predetermined condition, NLP model generates system User can also be reminded to click expansion scheme again, and repeat subsequent process, until assessment result is to meet predetermined condition.Such as This, after a small amount of initial sample of user's input, whole process generates system by NLP model and executes automatically, and user can root According to demand flexible choice.
Above procedure is looked back, during natural language processing, by the way that initial sample is extended the sample that is expanded, and By initial sample and extension sample together as training sample training Natural Language Processing Models, then according to current natural language The assessment of speech processing model determines whether to export current Natural Language Processing Models.Wherein, when being extended to initial sample, needle To each text in initial sample, the consistent text of text matrix size corresponds to identical expansion scheme, improves efficiency.It is another Aspect, a variety of expansion schemes are combined, and increase the comprehensive of text extension.In this way, can use Natural Language Processing Models Rule present in modeling is automatically performed the modeling of Natural Language Processing Models, and adds up outstanding warp by model evaluation mechanism It tests, to improve the validity of natural language processing.
According to the embodiment of another aspect, a kind of device of natural language processing is also provided.Fig. 4 is shown to be implemented according to one The schematic block diagram of the device of the natural language processing of example.As shown in figure 4, the device 400 of natural language processing includes: to obtain list Member 41 is configured to obtain multiple texts by manually marking as initial sample;Determination unit 42 is configured to determine initial sample The corresponding each text matrix of each text in this;Expanding element 43 is configured to each text in initial sample point It is not extended by least one expansion scheme in pre-stored expansion scheme collection, and raw according to the text that extension obtains At extension sample, wherein for each text in initial sample, the text that text matrix meets condition for consistence corresponds to identical Expansion scheme;Training unit 44, by initial sample and extension sample collectively as training sample training natural language processing mould Type;Assessment unit 45 assesses the Natural Language Processing Models trained, and meets the feelings of predetermined condition in assessment result Under condition, current Natural Language Processing Models are exported, to be used for natural language processing.
In one implementation, by pretreated text that the multiple texts manually marked are by segmenting, removing stop words This.
According to a kind of embodiment, in expanding element 43 preset condition for consistence can include but is not limited to it is following at least One:
The deviation of corresponding text matrix length and width is all within a predetermined range;
Matrix similarity is greater than predetermined matrices similarity threshold;
Cluster same category.
In one embodiment, the expansion scheme in expanding element 43 may include term vector extension, the term vector packet Include following at least one: the term vector of word-based insertion, the term vector based on stroke, the term vector based on affixe.
In a further embodiment, it is assumed that multiple texts in initial sample include the first text, are wrapped in the first text Include the first vocabulary, wherein the first text can be any one text of initial sample, and the first vocabulary can be in the first text Any vocabulary.At this point, expanding element 43 is also configured as:
For the first vocabulary, detects in corpus and be greater than predetermined word with the presence or absence of with the similarity of the term vector of the first vocabulary The similar vocabulary of remittance similarity threshold;
There are similar vocabulary, the first vocabulary is replaced with similar vocabulary, to be extended to the first text.
In another embodiment, the expansion scheme in expanding element 43 can also include that multilingual translation extends.Assuming that Initial sample includes the first text, and the first text is described by first language (such as Chinese), then expanding element 43 can also configure Are as follows:
By the first text by language conversion model translation at the second text described in second language (such as English);
By the second text by language conversion model translation at the third text described by first language;
The text that extension obtains is determined according to third text.
In a possible design, expanding element 43 is also configured as:
It is labeled by marking model trained in advance to obtained text is extended, and by annotation results and corresponding It extends obtained text and is formed together extension sample.
In one embodiment, the predetermined condition in assessment unit 45 can include but is not limited at least one of following:
The area under the curve AUC for the Natural Language Processing Models trained is greater than first threshold;
The balance F score for the Natural Language Processing Models trained is greater than second threshold, wherein balance F score is accurate The weighted average of rate and recall rate.
It is worth noting that device 400 shown in Fig. 4 be with Fig. 2 shows the corresponding device of embodiment of the method implement Example, Fig. 2 shows embodiment of the method in it is corresponding describe be equally applicable to device 400, details are not described herein.
By apparatus above, it can choose a variety of expansion schemes and initial sample be extended, and text matrix size The text for meeting condition for consistence corresponds to identical expansion scheme.Expansion scheme is determined by the assessment to the model trained It is whether suitable.In this way, the validity of natural language processing can be improved.
According to the embodiment of another aspect, a kind of computer readable storage medium is also provided, is stored thereon with computer journey Sequence enables computer execute method described in conjunction with Figure 2 when the computer program executes in a computer.
According to the embodiment of another further aspect, a kind of calculating equipment, including memory and processor, the memory are also provided In be stored with executable code, when the processor executes the executable code, realize the method in conjunction with described in Fig. 2.
Those skilled in the art are it will be appreciated that in said one or multiple examples, function described in the invention It can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these functions Storage in computer-readable medium or as on computer-readable medium one or more instructions or code transmitted.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all Including within protection scope of the present invention.

Claims (18)

1. a kind of method for the natural language processing that computer executes, which comprises
Multiple texts by manually marking are obtained as initial sample;
Determine the corresponding each text matrix of each text in the initial sample;
At least one of pre-stored expansion scheme collection extension side is passed through respectively to each text in the initial sample Case is extended, and extends sample according to the text generation that extension obtains, wherein for each text in the initial sample This, the text that text matrix meets condition for consistence corresponds to identical expansion scheme;
By the initial sample and the extension sample collectively as training sample training Natural Language Processing Models;
The Natural Language Processing Models trained are assessed, and in the case where assessment result meets predetermined condition, is exported Current Natural Language Processing Models, to be used for natural language processing.
2. according to the method described in claim 1, wherein, described by the multiple texts manually marked is by segmenting, removing The pretreated text of stop words.
3. according to the method described in claim 1, wherein, the condition for consistence includes at least one of the following:
The deviation of corresponding text matrix length and width is all within a predetermined range;
Matrix similarity is greater than predetermined matrices similarity threshold;
Cluster same category.
4. the term vector includes according to the method described in claim 1, wherein, the expansion scheme includes term vector extension Following at least one: the term vector of word-based insertion, the term vector based on stroke, the term vector based on affixe.
5. according to the method described in claim 4, wherein, multiple texts in the initial sample include the first text, described First text includes the first vocabulary;And
Each text in the initial sample passes through the progress of at least one expansion scheme in expansion scheme collection respectively Extension includes:
For first vocabulary, detect pre- with the presence or absence of being greater than with the similarity of the term vector of first vocabulary in corpus Determine the similar vocabulary of Lexical Similarity threshold value;
There are the similar vocabulary, first vocabulary is replaced with the similar vocabulary, to first text Originally it is extended.
6. according to the method described in claim 1, wherein, the expansion scheme includes multilingual translation extension, the initial sample Multiple texts in this include the first text, and first text is described by first language;
Each text in the initial sample passes through the progress of at least one expansion scheme in expansion scheme collection respectively Extension includes:
First text is passed through into the second text that language conversion model translation is described at second language;
By second text by language conversion model translation at the third text described by the first language;
The text that extension obtains is determined according to the third text.
7. according to the method described in claim 1, wherein, the text generation extension sample that extension is obtained includes:
It is labeled by marking model trained in advance to obtained text is extended, and by annotation results and corresponding extension Obtained text is formed together extension sample.
8. according to the method described in claim 1, wherein, the predetermined condition includes at least one of the following:
The area under the curve AUC for the Natural Language Processing Models trained is greater than first threshold;
The balance F score for the Natural Language Processing Models trained is greater than second threshold, wherein the balance F score is accurate The weighted average of rate and recall rate.
9. a kind of device of natural language processing, described device include:
Acquiring unit is configured to obtain multiple texts by manually marking as initial sample;
Determination unit is configured to determine the corresponding each text matrix of each text in the initial sample;
Expanding element is configured to concentrate each text in the initial sample by pre-stored expansion scheme respectively At least one expansion scheme is extended, and extends sample according to the text generation that extension obtains, wherein is directed to the initial sample Each text in this, the text that text matrix meets condition for consistence correspond to identical expansion scheme;
Training unit, by the initial sample and the extension sample collectively as training sample training natural language processing mould Type;
Assessment unit assesses the Natural Language Processing Models trained, and meets the feelings of predetermined condition in assessment result Under condition, current Natural Language Processing Models are exported, to be used for natural language processing.
10. device according to claim 9, wherein described by the multiple texts manually marked is by segmenting, removing The pretreated text of stop words.
11. device according to claim 9, wherein the condition for consistence includes at least one of the following:
The deviation of corresponding text matrix length and width is all within a predetermined range;
Matrix similarity is greater than predetermined matrices similarity threshold;
Cluster same category.
12. device according to claim 9, wherein the expansion scheme includes term vector extension, and the term vector includes Following at least one: the term vector of word-based insertion, the term vector based on stroke, the term vector based on affixe.
13. device according to claim 12, wherein multiple texts in the initial sample include the first text, institute Stating the first text includes the first vocabulary;And
The expanding element is additionally configured to:
For first vocabulary, detect pre- with the presence or absence of being greater than with the similarity of the term vector of first vocabulary in corpus Determine the similar vocabulary of Lexical Similarity threshold value;
There are the similar vocabulary, first vocabulary is replaced with the similar vocabulary, to first text Originally it is extended.
14. device according to claim 9, wherein the expansion scheme includes multilingual translation extension, the initial sample Multiple texts in this include the first text, and first text is described by first language;
The expanding element is additionally configured to:
First text is passed through into the second text that language conversion model translation is described at second language;
By second text by language conversion model translation at the third text described by the first language;
The text that extension obtains is determined according to the third text.
15. device according to claim 9, wherein the expanding element is also configured as:
It is labeled by marking model trained in advance to obtained text is extended, and by annotation results and corresponding extension Obtained text is formed together extension sample.
16. device according to claim 9, wherein the predetermined condition includes at least one of the following:
The area under the curve AUC for the Natural Language Processing Models trained is greater than first threshold;
The balance F score for the Natural Language Processing Models trained is greater than second threshold, wherein the balance F score is accurate The weighted average of rate and recall rate.
17. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require the method for any one of 1-8.
18. a kind of calculating equipment, including memory and processor, which is characterized in that be stored with executable generation in the memory Code realizes method of any of claims 1-8 when the processor executes the executable code.
CN201811519721.8A 2018-12-12 2018-12-12 The method and device of natural language processing Pending CN110008335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811519721.8A CN110008335A (en) 2018-12-12 2018-12-12 The method and device of natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811519721.8A CN110008335A (en) 2018-12-12 2018-12-12 The method and device of natural language processing

Publications (1)

Publication Number Publication Date
CN110008335A true CN110008335A (en) 2019-07-12

Family

ID=67165181

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811519721.8A Pending CN110008335A (en) 2018-12-12 2018-12-12 The method and device of natural language processing

Country Status (1)

Country Link
CN (1) CN110008335A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110705271A (en) * 2019-09-27 2020-01-17 中国建设银行股份有限公司 System and method for providing natural language processing service
CN111104482A (en) * 2019-12-18 2020-05-05 北京百度网讯科技有限公司 Data processing method and device
CN111242790A (en) * 2020-01-02 2020-06-05 平安科技(深圳)有限公司 Risk identification method, electronic device and storage medium
CN111324720A (en) * 2020-03-04 2020-06-23 万贝科技集团有限公司 Natural language processing method based on annular matching
CN111831788A (en) * 2020-06-16 2020-10-27 国网江苏省电力有限公司信息通信分公司 Electric power corpus marking model construction method and system
CN112000799A (en) * 2020-07-02 2020-11-27 广东华兴银行股份有限公司 Chinese public opinion monitoring method based on pinyin feature enhancement
CN112000800A (en) * 2020-07-02 2020-11-27 广东华兴银行股份有限公司 Chinese public opinion monitoring method based on Chinese character word-forming method
CN112446232A (en) * 2019-08-27 2021-03-05 贵州数安智能科技有限公司 Continuous self-learning image identification method and system
WO2021047003A1 (en) * 2019-09-09 2021-03-18 深圳前海微众银行股份有限公司 Text positioning method, apparatus, device, and storage medium
CN113779185A (en) * 2020-06-10 2021-12-10 武汉Tcl集团工业研究院有限公司 Natural language model generation method and computer equipment
CN114118068A (en) * 2022-01-26 2022-03-01 北京淇瑀信息科技有限公司 Method and device for amplifying training text data and electronic equipment
CN114972910A (en) * 2022-05-20 2022-08-30 北京百度网讯科技有限公司 Image-text recognition model training method and device, electronic equipment and storage medium
CN117131159A (en) * 2023-08-30 2023-11-28 上海通办信息服务有限公司 Method, device, equipment and storage medium for extracting sensitive information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197121A (en) * 2017-12-29 2018-06-22 北京中关村科金技术有限公司 Acquisition methods, system, device and the readable storage medium storing program for executing of machine learning language material
CN108897769A (en) * 2018-05-29 2018-11-27 武汉大学 Network implementations text classification data set extension method is fought based on production
CN108920482A (en) * 2018-04-27 2018-11-30 浙江工业大学 Microblogging short text classification method based on Lexical Chains feature extension and LDA model
CN108960276A (en) * 2018-05-08 2018-12-07 南京理工大学 The sample for promoting spectrum picture supervised classification performance expands and consistency discrimination method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108197121A (en) * 2017-12-29 2018-06-22 北京中关村科金技术有限公司 Acquisition methods, system, device and the readable storage medium storing program for executing of machine learning language material
CN108920482A (en) * 2018-04-27 2018-11-30 浙江工业大学 Microblogging short text classification method based on Lexical Chains feature extension and LDA model
CN108960276A (en) * 2018-05-08 2018-12-07 南京理工大学 The sample for promoting spectrum picture supervised classification performance expands and consistency discrimination method
CN108897769A (en) * 2018-05-29 2018-11-27 武汉大学 Network implementations text classification data set extension method is fought based on production

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112446232A (en) * 2019-08-27 2021-03-05 贵州数安智能科技有限公司 Continuous self-learning image identification method and system
WO2021047003A1 (en) * 2019-09-09 2021-03-18 深圳前海微众银行股份有限公司 Text positioning method, apparatus, device, and storage medium
CN110705271A (en) * 2019-09-27 2020-01-17 中国建设银行股份有限公司 System and method for providing natural language processing service
CN110705271B (en) * 2019-09-27 2024-01-26 中国建设银行股份有限公司 System and method for providing natural language processing service
CN111104482A (en) * 2019-12-18 2020-05-05 北京百度网讯科技有限公司 Data processing method and device
CN111242790A (en) * 2020-01-02 2020-06-05 平安科技(深圳)有限公司 Risk identification method, electronic device and storage medium
CN111242790B (en) * 2020-01-02 2020-11-17 平安科技(深圳)有限公司 Risk identification method, electronic device and storage medium
CN111324720A (en) * 2020-03-04 2020-06-23 万贝科技集团有限公司 Natural language processing method based on annular matching
CN113779185A (en) * 2020-06-10 2021-12-10 武汉Tcl集团工业研究院有限公司 Natural language model generation method and computer equipment
CN113779185B (en) * 2020-06-10 2023-12-29 武汉Tcl集团工业研究院有限公司 Natural language model generation method and computer equipment
CN111831788A (en) * 2020-06-16 2020-10-27 国网江苏省电力有限公司信息通信分公司 Electric power corpus marking model construction method and system
CN112000800A (en) * 2020-07-02 2020-11-27 广东华兴银行股份有限公司 Chinese public opinion monitoring method based on Chinese character word-forming method
CN112000799A (en) * 2020-07-02 2020-11-27 广东华兴银行股份有限公司 Chinese public opinion monitoring method based on pinyin feature enhancement
CN114118068A (en) * 2022-01-26 2022-03-01 北京淇瑀信息科技有限公司 Method and device for amplifying training text data and electronic equipment
CN114118068B (en) * 2022-01-26 2022-04-29 北京淇瑀信息科技有限公司 Method and device for amplifying training text data and electronic equipment
CN114972910A (en) * 2022-05-20 2022-08-30 北京百度网讯科技有限公司 Image-text recognition model training method and device, electronic equipment and storage medium
CN114972910B (en) * 2022-05-20 2023-05-23 北京百度网讯科技有限公司 Training method and device for image-text recognition model, electronic equipment and storage medium
CN117131159A (en) * 2023-08-30 2023-11-28 上海通办信息服务有限公司 Method, device, equipment and storage medium for extracting sensitive information

Similar Documents

Publication Publication Date Title
CN110008335A (en) The method and device of natural language processing
CN109977413B (en) Emotion analysis method based on improved CNN-LDA
CN109284357B (en) Man-machine conversation method, device, electronic equipment and computer readable medium
CN108363753B (en) Comment text emotion classification model training and emotion classification method, device and equipment
CN110825876B (en) Movie comment viewpoint emotion tendency analysis method
Lazaridou et al. Compositional-ly derived representations of morphologically complex words in distributional semantics
CN103268339B (en) Named entity recognition method and system in Twitter message
CN106844632B (en) Product comment emotion classification method and device based on improved support vector machine
Petrović et al. Unsupervised joke generation from big data
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN111125349A (en) Graph model text abstract generation method based on word frequency and semantics
CN108363790A (en) For the method, apparatus, equipment and storage medium to being assessed
CN109711465A (en) Image method for generating captions based on MLL and ASCA-FR
WO2022183923A1 (en) Phrase generation method and apparatus, and computer readable storage medium
CN107193806A (en) A kind of vocabulary justice former automatic prediction method and device
CN110472040A (en) Extracting method and device, storage medium, the computer equipment of evaluation information
EP2553612A1 (en) System
CN108399265A (en) Real-time hot news providing method based on search and device
CN108470026A (en) The sentence trunk method for extracting content and device of headline
CN113326374B (en) Short text emotion classification method and system based on feature enhancement
Gui et al. A mixed model for cross lingual opinion analysis
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN107038155A (en) The extracting method of text feature is realized based on improved small-world network model
González et al. ELiRF-UPV at SemEval-2019 task 3: snapshot ensemble of hierarchical convolutional neural networks for contextual emotion detection
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20201013

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201013

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.