CN107894976A

CN107894976A - A kind of mixing language material segmenting method based on Bi LSTM

Info

Publication number: CN107894976A
Application number: CN201710946891.3A
Authority: CN
Inventors: 岳永鹏; 唐华阳
Original assignee: Beijing Know Future Information Technology Co ltd
Current assignee: Beijing Know Future Information Technology Co ltd
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-04-10

Abstract

The invention discloses a kind of mixing language material segmenting method based on Bi LSTM.This method is：The corpus data that mixing corpus data will be trained to be converted into character level；Count the corpus data character obtain a character set merge each character is numbered, obtain character number set；Statistics alphanumeric tag obtains a tag set, and tag number set is obtained to tag number；Language material is divided according to sentence length, obtained sentence is grouped according to sentence length, obtains data acquisition system；It is random that sentence packet is chosen from the data acquisition system without what is put back to, multiple sentences are therefrom extracted, the character of each sentence forms a data w, and corresponding tag set is y；Data w is converted into corresponding numbering and label y is sent into Model B i LSTM, trains the parameter of the deep learning model；By data conversion to be predicted into the data with the deep learning Model Matching, and the deep learning model trained is sent to, obtains word segmentation result.

Description

A kind of mixing language material segmenting method based on Bi-LSTM

Technical field

The invention belongs to computer software technical field, is related to a kind of mixing language material segmenting method based on Bi-LSTM.

Background technology

Bi-LSTM, English full name are：Bi-directional Long Short-Term Memory, Chinese are meant Refer to, two-way shot and long term Memory Neural Networks.

Mix language material, in this patent, refer to training or the data predicted in contain the languages of at least two language Expect data.

Participle (Word Segment) refer to input continuation character string according to semanteme information marked into it is continuous Sequence label.In this patent, word (simplified form of Chinese Character, Chinese-traditional, the Korean and Japanese) sequence number to Asia type referred to According to being cut into single word one by one, and the segmentation between its word and word is used as using space.

The professional knowledge that the method for the participle of mixing language material is related to has two aspects：On the one hand it is by the data of a variety of language materials Form carries out unification according to character level otherwise；On the other hand the professional knowledge being related to is mainly the sequence in natural language processing Row mark (sequential labeling) refers to using a sequence as input, and trains a model to be each sequence Column-slice segment data produces correctly output.

For the segmenting method of multilingual, traditional process is：Multilingual input text -->(segmentation or subordinate sentence) text language Speech checks -->Participle.

Inspection to text language is firstly the need of the inspection for firstly the need of the granularity for determining to check, being chapter rank, still Multiple two kinds or more language are included for a document and just occur that detection is inaccurate, so as to only handle a kind of language and neglect Slightly another language.Now with regard to needing to carry out more fine-grained division, segmentation or subordinate sentence do language detection.Present invention mixing language material Participle can simplify and be to the segmenting method of traditional multilingual：Multilingual input text -->Participle.So as to avoid The process of segmentation, subordinate sentence and text language detection.

Meanwhile the method for hybrid language participle involved in the present invention, its application scenarios also include：

1. the full-text index in multilingual search engine：An important function is exactly to do the full text of document in search engine Index, its content is to be segmented word, and the word segmentation result of document and document then are formed into an inverted index, Yong Hu It is also first to be segmented the read statement of inquiry when inquiry, then carries out the result of participle and index data base pair Than so as to find out the document the most similar to currently inputting.

2. multilingual autoabstract generation：Autoabstract refers to a longer document with one section of shorter spoken and written languages Go to summarize.And during summary, it is necessary to keyword in a document is calculated, therefore must be first before keyword is calculated Word segmentation processing is done to document.

3. multilingual automatic Proofreading：Automatic Proofreading refers to the inspection for making syntax error to passage, its granularity checked Or the inspection that word-based mode is done, it is therefore desirable to which the continuous word that user inputs is done into word segmentation processing.

The step of traditional segmenting method to the mixing text comprising multilingual：

Multilingual input text -->(segmentation or subordinate sentence) text language inspection -->Participle

And its participle to each language can use two kinds of sides of the participle based on dictionary and the participle based on statistics Formula.Participle based on dictionary is will to search possible participle to be all included in a dictionary, then there is Forward Maximum Method or forward direction The mode of smallest match is cut by dictionary vocabulary.Another segmenting method based on statistics, its principle are about：Count phase The frequency that adjacent word occurs, if frequency exceedes the word that given threshold value is taken as a regular collocation, and as one Participle unit.The shortcomings that it is present be：

Shortcoming 1：To the multilingual bad differentiation of detection granularity, and participle can be caused because certain language does not detect The loss of precision.Multilingual is included for a document, it is necessary first to segment processing, class of languages then is done to each paragraph The detection of type, but if also including the situation of multilingual to being included in paragraph, need to do the processing of subordinate sentence again, in sentence It all can not do to do again comprising multilingual and split.Model and the serious dependence of language material because of participle, as a result just occur because of certain Kind language detects and loses the information of participle.Break the participle model dependence serious with language material in the present invention to ask Topic, the language material of multilingual is mixed, and a unified model is trained on the basis of language material is mixed.

Shortcoming 2：Method based on dictionary excessively relies on dictionary, it is impossible to was not occurred according to the identification of the information of semanteme in dictionary Unregistered word.

Shortcoming 3：The mode for being currently based on statistics is mainly HMM (hidden horse paediatrics husband) models and CRF (condition random field) mould Type because the complexity calculated, be between its current word only considered and a upper word it is associated, remaining word and word it Between be conditional sampling, this is inconsistent with reality, thus its participle precision have the space further lifted. In the present invention, invention introduces Bi-LSTM methods, on the one hand this method is Statistics-Based Method, can break word-based The problem of method of allusion quotation is to dictionary heavy dependence.On the other hand, should be Bi-LSTM is length memory models, and the model is also broken Current word based on conventional statistics it is only related to a upper word it is assumed that but word all closes up and down in current statement, so as to inhale More semantic informations are taken.

The content of the invention

For technical problem present in prior art, it is an object of the invention to provide a kind of mixed based on Bi-LSTM Language material segmenting method is closed, core of the invention includes two parts：

Part 1：The unification of multilingual mixing language material form

Multilingual participle is needed first to do the test problems of language form in order to evade, proposed in the present invention based on word Accord with rank segmenting method, and will it is more in language form mixing language material be put into together in deep learning model, be trained.

Part 2：Lift the precision of multilingual participle

In order to lift the precision of multilingual participle, it only with a upper word is phase to overcome in CRF (condition random field) current word The situation of mutual correlation, we introduce the length Memory Neural Networks model (LSTM) of deep learning so that current word and sentence In before all words be all mutually related.

The technical scheme is that：

A kind of mixing language material segmenting method based on Bi-LSTM, its step include：

1) the mixing corpus data Corpus_ that mixing corpus data Original_Corpus will be trained to be converted into character level by_Char；

2) the mixing corpus data Corpus_by_Char characters are counted and obtain a character set CharSet, and to the word Each character is numbered in symbol set CharSet, obtains character number set corresponding to character set CharSet CharID；The label of the character in mixing corpus data Corpus_by_Char is counted, obtains a tag set LabelSet, Tag set LabelSet label is numbered, obtains corresponding tag number set LabelID；

3) the mixing corpus data Corpus_by_Char is divided according to sentence length, obtains some sentences；Then root Obtained sentence is grouped according to sentence length, obtains the data acquisition system GroupData for including n group sentences；

4) it is random that sentence packet is chosen from data acquisition system GroupData without what is put back to, taken out from sentence packet BatchSize sentence is taken, the character of each sentence forms a data w, and tag set corresponding to the character of the sentence is y； Data w is converted to by corresponding numbering according to character number set CharID, obtains data BatchData；According to tag number Label in set y is converted to corresponding numbering by set LabelID, obtains data y_ID；

5) by the multiple data BatchData and its corresponding label data y of step 4) generation_IDDeep learning is sent into together Model B i-LSTM, deep learning Model B i-LSTM parameter is trained, as penalty values Cost caused by deep learning model (y′,y_ID) reach and impose a condition or reach maximum iteration N, then the training of deep learning model is terminated, after obtaining training Deep learning Model B i-LSTM；Otherwise data BatchData is regenerated using the method for step 4) and trains the deep learning Model B i-LSTM；

6) mixing corpus data PreData to be predicted is converted into the number matched with deep learning Model B i-LSTM According to PreMData, and the deep learning Model B i-LSTM trained is sent to, obtains word segmentation result OrgResult.

Further, data BatchData length is a regular length maxLen, when the data sentence length being drawn into When spending l ＜ maxLen, maxLen-l 0 will be mended behind the sentence, obtain BatchData；And by corresponding data y_IDMend below MaxLen-l 0, obtain data y_ID；Wherein, the Bi-LSTM layers that maxLen is equal in deep learning Model B i-LSTM are positive LSTM unit numbers.

Further, penalty values Cost (y ', y are produced_ID) method be：

31) Embedding layers of the data BatchData in deep learning Model B i-LSTM is subjected to vectorization, by data Each character in BatchData is converted into a vector；

32) by the incoming deep learning Model B i-LSTM of vector corresponding to each data BatchData Bi-LSTM layers, wherein Vector corresponding to each character in data BatchData is passed to a LSTM units of Bi-LSTM layer forward and reverses respectively； And i-th of LSTM unit, the i-th -1 reverse LSTM of the output result input forward direction of the i-th -1 positive LSTM unit are mono- I-th of the LSTM unit of the output result input of member reversely；

33) by the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, will it is positive and The output result of reverse LSTM units, which is stitched together, to be combined intoAnd incoming DropOut layers；

35) by the output of the DropOut layers after the processing of SoftMax layers, by obtained output y ' and incoming data y_ID Calculate together and produce penalty values Cost (y ', y_ID)。

Further, penalty values Cost (y ', the y_ID)=- y_IDlog(y′)+(1-y_ID)log(1-y′)；Wherein y ' tables Output of the registration according to BatchData after the SoftMax layers.

Further, it is described impose a condition for：Penalty values Cost (y ', the y currently calculated_ID) flat with preceding m penalty values The difference of average is less than threshold θ.

Further, in the step 2), will | l_i-l_j| ＜ δ sentence is included into one group；Wherein, l_iRepresent the i-th word Sentence length, l_jThe sentence length of jth word is represented, δ represents sentence length interval.

Further, in the step 1), using BMESO mark mode by the training mixing corpus data Each word with label in Original_Corpus is according to character level cutting, if label corresponding to a word is Label, the then character marking most started positioned at the word are Label_B, and the character marking among word is Label_M, Character marking positioned at word end is Label_E, and Label_S is labeled as if word only has an independent character.

Further, deep learning Model B i-LSTM-CNN parameter is trained using Adam gradient descent algorithms.

The flow of the inventive method such as Fig. 1, in two stages：Training stage, forecast period.

(1) training stage：(left side dotted line frame of flow chart)

Step 1：Training mixing language material data with label are converted to the mixing corpus data of character level.

Step 2：Deep learning model is trained using Adam gradient descent algorithms.

(2) forecast period：(the right dotted line frame of flow chart)

Step 1：The test mixing corpus data of no label is converted to the mixing corpus data of character level.

Step 2：The deep learning model trained using the training stage is predicted.

The present invention mainly has advantages below：

Mixing language material segmenting method of the invention based on Bi-LSTM, using character level rather than word-level, can evade not Word problem is logged in, precision improves a lot.Directly model training is carried out using mixing language material, it is not necessary to will mix the every of language material Individual languages are detected and separated, and eventually arrive at the purpose that can identify mixing language material.

Brief description of the drawings

Fig. 1 is the inventive method flow chart.

Fig. 2 is deep learning model support composition.

Embodiment

To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's accompanying drawing to make Describe in detail as follows.

The method flow of the present invention is as shown in figure 1, it includes：

(1) training stage：

Step 1：Original training mixing corpus data Original_Corpus is converted into the mixing corpus data of character level Corpus_by_Char.Specially：Using BMESO (Begin, Middle, End, Single, Other) mark mode, by original Begin to train each word with label in mixing corpus data according to character level cutting.If label corresponding to some word is Label, the then character marking most started positioned at word are Label_B, and the character marking among word is Label_M, position Character marking in word end is Label_E, Label_S is labeled as if word only has an independent character, because word For language when Bi-LSTM models are entered, LSTM is substantially a neural network model, it is necessary to which it is fixed to keep inputting length , the situation inadequate to a sentence length is then supplemented with O (Other).In mixing language material participle, by multilingual according to upper State and require format flags and be mixed into a corpus data collections.

Step 2：The character in Corpus_by_Char is counted, a character set CharSet is obtained, for example, it is assumed that having two Individual word is：China, China, the character set after merging are just：In, China, state }.Will be every in character set CharSet Individual character is numbered according to natural number increasing, obtains character number set CharID corresponding to character set CharSet.Statistics The label of character in Corpus_by_Char, obtain a tag set LabelSet, similarly produced corresponding to label Number set LabelID.Tag set LabelSet is generally { B, M, E, S }, LabelID corresponding to generation, i.e., by tally set These characters closed in LabelSet change into a numeral to represent, and are easy to procedure identification.

Step 3：Corpus_by_Char is divided according to sentence length.If l_iThe sentence length of the i-th word is represented, then will |l_i-l_j| ＜ δ sentence is included into one group, and wherein δ represents sentence length interval.If the data after packet are GroupData, one N groups are set to altogether.

Step 4：BatchSize sentences, the word of each sentence are extracted in certain group sentence from GroupData that random nothing is put back to Symbol forms data w, and tag set corresponding to the character of the sentence is y, and the data w of extraction is converted to correspondingly by CharID Numbering, be fixed length data BatchData (correspond to Fig. 2 in w₁,w₂,…,w_n), and the label y corresponding to Corresponding numbering is converted to by LabelID, is fixed the data y of length_ID.Because same group of sentence length approaches, phase For disorderly extracting, precision improves about 2 percentage points.

Step 5：By multiple data BatchData of step 4 and its corresponding label data y_IDDeep learning is sent into together Model, produce penalty values Cost (y ', y_ID).Specific calculation formula is as follows：

Cost(y′,y_ID)=- y_IDlog(y′)+(1-y_ID) log (1-y ') (formula 1)

Outputs of the wherein y ' expressions BatchData after deep learning category of model layer (SoftMax layers).Corresponding to figure Y1, y2 in 2 ..., y_n。

Step 6：The ginseng of loss function in deep learning model is trained using batch gradient descent algorithm (mini_batch) Number.

Step 7：If Cost (y ', y caused by deep learning model_ID) penalty values that are calculated no longer reduce, or Reach maximum iteration N, then terminate the training of deep learning model；Otherwise step 4 is jumped to.

Wherein, Cost_i′(y′,y_ID) penalty values before expression during i iteration, Cost (y ', y_ID) represent that current iteration produces Penalty values, the expression of this formula is meant if current penalty values and the difference of the average value of preceding M penalty values are less than threshold Value θ, then it is assumed that no longer reduce.

Forecast period：

Step 1：Mixing corpus data PreData to be predicted is converted into the data format with Model Matching PreMData.Specially：Numerical data by mixing language material data conversion to be predicted into character level.

Step 2：The deep learning model that the PreMData feeding training stages are trained, and obtain segmenting prediction result OrgResult。

Described in training stage step 4：The data of extraction are converted to the data of several regular lengths by CharID BatchData, and corresponding label is converted to by LabelID the data y of several regular lengths_ID.Specially：

Step 1：The data w being drawn into is converted into numeral, namely the corresponding relation by CharSet and CharID, by w In each character be converted into corresponding numeral.

Step 2：Tag set y corresponding to the data w of extraction is converted into numeral, namely by LabelSet with LabelID corresponding relation, each character in y is converted into corresponding numeral, obtains data y_ID。

Step 3：Assuming that specific length is maxLen, as the data sentence length l ＜ maxLen being drawn into, after sentence Face mends maxLen-l 0, obtains BatchData；MaxLen is equal to the positive LSTM unit numbers of Bi-LSTM layers；Because generally It is very long only less than 5% sentence length, if too taking notice of the sentence of those length, then precision can reduce much (if there is l >=maxLen situation, simple processing mode is directly to lose, or is divided into short sentence to enter long sentence Row processing).And by data y corresponding to w_IDMaxLen-l 0 is mended below, obtains y_ID。

The deep learning model support composition of the present invention is as shown in Fig. 2 described in training stage step 5：By data BatchData and its label data y_IDDeep learning model is sent into, produces penalty values Cost (y ', y_ID), it is specially：

Step 1：Incoming data BatchData is subjected to vectorization in Embedding layers, also i.e. by each data Each character in BatchData is converted into corresponding to character ID numberings by the vector table Char2Vec of a character steering volume Vector.Each character has a corresponding character ID numbering in vector table Char2Vec.

Step 2：The vector that step 1 is obtained is passed to Bi-LSTM layers, is in detail：By in every data BatchData Vectorial w corresponding to one character₁First positive LSTM unit of incoming Bi-LSTM layers, vectorial w corresponding to second character₂Pass Enter second positive LSTM unit of Bi-LSTM layers, the like；The input of i-th positive of LSTM unit is except every simultaneously Vector is outer corresponding to i-th of character in data, also includes the output of the i-th -1 positive LSTM unit, i.e. the of forward direction I-th of LSTM unit of the output result input forward direction of i-1 LSTM unit.Per the first character in data BatchData First reverse LSTM unit of the incoming Bi-LSTM layers of vector corresponding to symbol, the incoming Bi-LSTM of vector corresponding to second character Second reverse LSTM unit of layer, the like；The input of i-th simultaneously reverse of LSTM unit is except in every data Vector is outer corresponding to i-th of character, also includes the output of the i-th -1 reverse LSTM unit, i.e., the i-th -1 reverse LSTM is mono- I-th of the LSTM unit of the output result input of member reversely.Pay attention to, the vector that each LSTM units once receive not is only There is one, but BatchSize is individual.

Step 3：By the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, that is to say by The output result of the LSTM units of forward and reverse, which is stitched together, to be combined into

Step 4：The output of Concatenate layers is passed to DropOut layers, that is to say random by h_iMiddle η (0≤η≤1) Image watermarking fall, do not allow its continuation to be transmitted backward.

Step 5：DropOut output is passed to SoftMax layers, and by Softmax output y ' and the corresponding mark being passed to Sign data y_IDProduce final penalty values Cost (y ', y_ID), specific calculate sees formula 1.

The deep learning model that deep learning model described in forecast period step 2, as training stage train, but In prediction, parameter η=1 for the DropOut layers being directed to, represent not hide any data, be all delivered to next Layer.

Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Technical scheme can be modified by member or equivalent substitution, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be to be defined described in claims.

Claims

1. a kind of mixing language material segmenting method based on Bi-LSTM, its step include：

1) the mixing corpus data Corpus_by_ that mixing corpus data Original_Corpus will be trained to be converted into character level Char；

2) the mixing corpus data Corpus_by_Char characters are counted and obtain a character set CharSet, and to the character set Close each character in CharSet to be numbered, obtain character number set CharID corresponding to character set CharSet；System The label of the character in mixing corpus data Corpus_by_Char is counted, a tag set LabelSet is obtained, to the label Set LabelSet label is numbered, and obtains corresponding tag number set LabelID；

3) the mixing corpus data Corpus_by_Char is divided according to sentence length, obtains some sentences；Then according to sentence Sub- length is grouped to obtained sentence, obtains the data acquisition system GroupData for including n group sentences；

4) it is random that sentence packet is chosen from data acquisition system GroupData without what is put back to, extracted from sentence packet BatchSize sentence, the character of each sentence form a data w, and tag set corresponding to the character of the sentence is y；Root Data w is converted into corresponding numbering according to character number set CharID, obtains data BatchData；According to tag number collection Close LabelID and the label in set y is converted into corresponding numbering, obtain data y_ID；

5) by the multiple data BatchData and its corresponding label data y of step 4) generation_IDDeep learning model is sent into together Bi-LSTM, deep learning Model B i-LSTM parameter is trained, as penalty values Cost (y ', y caused by deep learning model_ID) Satisfaction imposes a condition or reaches maximum iteration N, then terminates the training of deep learning model, the depth after being trained Practise Model B i-LSTM；Otherwise data BatchData is regenerated using the method for step 4) and trains deep learning Model B i- LSTM；

6) mixing corpus data PreData to be predicted is converted into the data matched with deep learning Model B i-LSTM PreMData, and the deep learning Model B i-LSTM trained is sent to, obtain word segmentation result OrgResult.

2. the method as described in claim 1, it is characterised in that data BatchData length is a regular length MaxLen, as the data sentence length l ＜ maxLen being drawn into, maxLen-l 0 will be mended behind the sentence, obtained BatchData；And by corresponding data y_IDMaxLen-l 0 is mended below, obtains data y_ID；Wherein, maxLen is equal to the depth The LSTM unit numbers of Bi-LSTM layers forward direction in learning model Bi-LSTM.

3. method as claimed in claim 2, it is characterised in that produce penalty values Cost (y ', y_ID) method be：

32) by the incoming deep learning Model B i-LSTM of vector corresponding to each data BatchData Bi-LSTM layers, the wherein number It is passed to a LSTM units of Bi-LSTM layer forward and reverses respectively according to vector corresponding to each character in BatchData；And just To positive i-th of the LSTM unit of output result input of the i-th -1 LSTM unit, the i-th -1 reverse LSTM unit I-th reverse of LSTM unit of output result input；

33) by the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, i.e., by forward and reverse The output results of LSTM units be stitched together and be combined intoAnd incoming DropOut layers；

35) by the output of the DropOut layers after the processing of SoftMax layers, by obtained output y ' and incoming data y_IDTogether Calculate and produce penalty values Cost (y ', y_ID)。

4. method as claimed in claim 3, it is characterised in that penalty values Cost (y ', the y_ID)=- y_IDlog(y′)+(1- y_ID)log(1-y′)；Outputs of the wherein y ' expressions data BatchData after the SoftMax layers.

5. the method as described in claim 1, it is characterised in that it is described impose a condition for：The penalty values Cost currently calculated (y′,y_ID) with the difference of the average value of preceding m penalty values it is less than threshold θ.

6. the method as described in claim 1, it is characterised in that in the step 2), will | l_i-l_j| ＜ δ sentence is included into one Group；Wherein, l_iRepresent sentence length, the l of the i-th word_jThe sentence length of jth word is represented, δ represents sentence length interval.

7. the method as described in claim 1, it is characterised in that in the step 1), instructed this using BMESO mark mode Practice each word with label in mixing corpus data Original_Corpus according to character level cutting, if a word pair The label answered is Label, then the character marking most started positioned at the word is Label_B, the character marking among word For Label_M, the character marking positioned at word end is Label_E, is labeled as if word only has an independent character Label_S。

8. the method as described in claim 1, it is characterised in that train the deep learning model using Adam gradient descent algorithms Bi-LSTM-CNN parameter.