CN107894976A - A kind of mixing language material segmenting method based on Bi LSTM - Google Patents

A kind of mixing language material segmenting method based on Bi LSTM Download PDF

Info

Publication number
CN107894976A
CN107894976A CN201710946891.3A CN201710946891A CN107894976A CN 107894976 A CN107894976 A CN 107894976A CN 201710946891 A CN201710946891 A CN 201710946891A CN 107894976 A CN107894976 A CN 107894976A
Authority
CN
China
Prior art keywords
data
lstm
character
sentence
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201710946891.3A
Other languages
Chinese (zh)
Inventor
岳永鹏
唐华阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Know Future Information Technology Co ltd
Original Assignee
Beijing Know Future Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Know Future Information Technology Co ltd filed Critical Beijing Know Future Information Technology Co ltd
Priority to CN201710946891.3A priority Critical patent/CN107894976A/en
Publication of CN107894976A publication Critical patent/CN107894976A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of mixing language material segmenting method based on Bi LSTM.This method is:The corpus data that mixing corpus data will be trained to be converted into character level;Count the corpus data character obtain a character set merge each character is numbered, obtain character number set;Statistics alphanumeric tag obtains a tag set, and tag number set is obtained to tag number;Language material is divided according to sentence length, obtained sentence is grouped according to sentence length, obtains data acquisition system;It is random that sentence packet is chosen from the data acquisition system without what is put back to, multiple sentences are therefrom extracted, the character of each sentence forms a data w, and corresponding tag set is y;Data w is converted into corresponding numbering and label y is sent into Model B i LSTM, trains the parameter of the deep learning model;By data conversion to be predicted into the data with the deep learning Model Matching, and the deep learning model trained is sent to, obtains word segmentation result.

Description

A kind of mixing language material segmenting method based on Bi-LSTM
Technical field
The invention belongs to computer software technical field, is related to a kind of mixing language material segmenting method based on Bi-LSTM.
Background technology
Bi-LSTM, English full name are:Bi-directional Long Short-Term Memory, Chinese are meant Refer to, two-way shot and long term Memory Neural Networks.
Mix language material, in this patent, refer to training or the data predicted in contain the languages of at least two language Expect data.
Participle (Word Segment) refer to input continuation character string according to semanteme information marked into it is continuous Sequence label.In this patent, word (simplified form of Chinese Character, Chinese-traditional, the Korean and Japanese) sequence number to Asia type referred to According to being cut into single word one by one, and the segmentation between its word and word is used as using space.
The professional knowledge that the method for the participle of mixing language material is related to has two aspects:On the one hand it is by the data of a variety of language materials Form carries out unification according to character level otherwise;On the other hand the professional knowledge being related to is mainly the sequence in natural language processing Row mark (sequential labeling) refers to using a sequence as input, and trains a model to be each sequence Column-slice segment data produces correctly output.
For the segmenting method of multilingual, traditional process is:Multilingual input text -->(segmentation or subordinate sentence) text language Speech checks -->Participle.
Inspection to text language is firstly the need of the inspection for firstly the need of the granularity for determining to check, being chapter rank, still Multiple two kinds or more language are included for a document and just occur that detection is inaccurate, so as to only handle a kind of language and neglect Slightly another language.Now with regard to needing to carry out more fine-grained division, segmentation or subordinate sentence do language detection.Present invention mixing language material Participle can simplify and be to the segmenting method of traditional multilingual:Multilingual input text -->Participle.So as to avoid The process of segmentation, subordinate sentence and text language detection.
Meanwhile the method for hybrid language participle involved in the present invention, its application scenarios also include:
1. the full-text index in multilingual search engine:An important function is exactly to do the full text of document in search engine Index, its content is to be segmented word, and the word segmentation result of document and document then are formed into an inverted index, Yong Hu It is also first to be segmented the read statement of inquiry when inquiry, then carries out the result of participle and index data base pair Than so as to find out the document the most similar to currently inputting.
2. multilingual autoabstract generation:Autoabstract refers to a longer document with one section of shorter spoken and written languages Go to summarize.And during summary, it is necessary to keyword in a document is calculated, therefore must be first before keyword is calculated Word segmentation processing is done to document.
3. multilingual automatic Proofreading:Automatic Proofreading refers to the inspection for making syntax error to passage, its granularity checked Or the inspection that word-based mode is done, it is therefore desirable to which the continuous word that user inputs is done into word segmentation processing.
The step of traditional segmenting method to the mixing text comprising multilingual:
Multilingual input text -->(segmentation or subordinate sentence) text language inspection -->Participle
And its participle to each language can use two kinds of sides of the participle based on dictionary and the participle based on statistics Formula.Participle based on dictionary is will to search possible participle to be all included in a dictionary, then there is Forward Maximum Method or forward direction The mode of smallest match is cut by dictionary vocabulary.Another segmenting method based on statistics, its principle are about:Count phase The frequency that adjacent word occurs, if frequency exceedes the word that given threshold value is taken as a regular collocation, and as one Participle unit.The shortcomings that it is present be:
Shortcoming 1:To the multilingual bad differentiation of detection granularity, and participle can be caused because certain language does not detect The loss of precision.Multilingual is included for a document, it is necessary first to segment processing, class of languages then is done to each paragraph The detection of type, but if also including the situation of multilingual to being included in paragraph, need to do the processing of subordinate sentence again, in sentence It all can not do to do again comprising multilingual and split.Model and the serious dependence of language material because of participle, as a result just occur because of certain Kind language detects and loses the information of participle.Break the participle model dependence serious with language material in the present invention to ask Topic, the language material of multilingual is mixed, and a unified model is trained on the basis of language material is mixed.
Shortcoming 2:Method based on dictionary excessively relies on dictionary, it is impossible to was not occurred according to the identification of the information of semanteme in dictionary Unregistered word.
Shortcoming 3:The mode for being currently based on statistics is mainly HMM (hidden horse paediatrics husband) models and CRF (condition random field) mould Type because the complexity calculated, be between its current word only considered and a upper word it is associated, remaining word and word it Between be conditional sampling, this is inconsistent with reality, thus its participle precision have the space further lifted. In the present invention, invention introduces Bi-LSTM methods, on the one hand this method is Statistics-Based Method, can break word-based The problem of method of allusion quotation is to dictionary heavy dependence.On the other hand, should be Bi-LSTM is length memory models, and the model is also broken Current word based on conventional statistics it is only related to a upper word it is assumed that but word all closes up and down in current statement, so as to inhale More semantic informations are taken.
The content of the invention
For technical problem present in prior art, it is an object of the invention to provide a kind of mixed based on Bi-LSTM Language material segmenting method is closed, core of the invention includes two parts:
Part 1:The unification of multilingual mixing language material form
Multilingual participle is needed first to do the test problems of language form in order to evade, proposed in the present invention based on word Accord with rank segmenting method, and will it is more in language form mixing language material be put into together in deep learning model, be trained.
Part 2:Lift the precision of multilingual participle
In order to lift the precision of multilingual participle, it only with a upper word is phase to overcome in CRF (condition random field) current word The situation of mutual correlation, we introduce the length Memory Neural Networks model (LSTM) of deep learning so that current word and sentence In before all words be all mutually related.
The technical scheme is that:
A kind of mixing language material segmenting method based on Bi-LSTM, its step include:
1) the mixing corpus data Corpus_ that mixing corpus data Original_Corpus will be trained to be converted into character level by_Char;
2) the mixing corpus data Corpus_by_Char characters are counted and obtain a character set CharSet, and to the word Each character is numbered in symbol set CharSet, obtains character number set corresponding to character set CharSet CharID;The label of the character in mixing corpus data Corpus_by_Char is counted, obtains a tag set LabelSet, Tag set LabelSet label is numbered, obtains corresponding tag number set LabelID;
3) the mixing corpus data Corpus_by_Char is divided according to sentence length, obtains some sentences;Then root Obtained sentence is grouped according to sentence length, obtains the data acquisition system GroupData for including n group sentences;
4) it is random that sentence packet is chosen from data acquisition system GroupData without what is put back to, taken out from sentence packet BatchSize sentence is taken, the character of each sentence forms a data w, and tag set corresponding to the character of the sentence is y; Data w is converted to by corresponding numbering according to character number set CharID, obtains data BatchData;According to tag number Label in set y is converted to corresponding numbering by set LabelID, obtains data yID
5) by the multiple data BatchData and its corresponding label data y of step 4) generationIDDeep learning is sent into together Model B i-LSTM, deep learning Model B i-LSTM parameter is trained, as penalty values Cost caused by deep learning model (y′,yID) reach and impose a condition or reach maximum iteration N, then the training of deep learning model is terminated, after obtaining training Deep learning Model B i-LSTM;Otherwise data BatchData is regenerated using the method for step 4) and trains the deep learning Model B i-LSTM;
6) mixing corpus data PreData to be predicted is converted into the number matched with deep learning Model B i-LSTM According to PreMData, and the deep learning Model B i-LSTM trained is sent to, obtains word segmentation result OrgResult.
Further, data BatchData length is a regular length maxLen, when the data sentence length being drawn into When spending l < maxLen, maxLen-l 0 will be mended behind the sentence, obtain BatchData;And by corresponding data yIDMend below MaxLen-l 0, obtain data yID;Wherein, the Bi-LSTM layers that maxLen is equal in deep learning Model B i-LSTM are positive LSTM unit numbers.
Further, penalty values Cost (y ', y are producedID) method be:
31) Embedding layers of the data BatchData in deep learning Model B i-LSTM is subjected to vectorization, by data Each character in BatchData is converted into a vector;
32) by the incoming deep learning Model B i-LSTM of vector corresponding to each data BatchData Bi-LSTM layers, wherein Vector corresponding to each character in data BatchData is passed to a LSTM units of Bi-LSTM layer forward and reverses respectively; And i-th of LSTM unit, the i-th -1 reverse LSTM of the output result input forward direction of the i-th -1 positive LSTM unit are mono- I-th of the LSTM unit of the output result input of member reversely;
33) by the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, will it is positive and The output result of reverse LSTM units, which is stitched together, to be combined intoAnd incoming DropOut layers;
35) by the output of the DropOut layers after the processing of SoftMax layers, by obtained output y ' and incoming data yID Calculate together and produce penalty values Cost (y ', yID)。
Further, penalty values Cost (y ', the yID)=- yIDlog(y′)+(1-yID)log(1-y′);Wherein y ' tables Output of the registration according to BatchData after the SoftMax layers.
Further, it is described impose a condition for:Penalty values Cost (y ', the y currently calculatedID) flat with preceding m penalty values The difference of average is less than threshold θ.
Further, in the step 2), will | li-lj| < δ sentence is included into one group;Wherein, liRepresent the i-th word Sentence length, ljThe sentence length of jth word is represented, δ represents sentence length interval.
Further, in the step 1), using BMESO mark mode by the training mixing corpus data Each word with label in Original_Corpus is according to character level cutting, if label corresponding to a word is Label, the then character marking most started positioned at the word are Label_B, and the character marking among word is Label_M, Character marking positioned at word end is Label_E, and Label_S is labeled as if word only has an independent character.
Further, deep learning Model B i-LSTM-CNN parameter is trained using Adam gradient descent algorithms.
The flow of the inventive method such as Fig. 1, in two stages:Training stage, forecast period.
(1) training stage:(left side dotted line frame of flow chart)
Step 1:Training mixing language material data with label are converted to the mixing corpus data of character level.
Step 2:Deep learning model is trained using Adam gradient descent algorithms.
(2) forecast period:(the right dotted line frame of flow chart)
Step 1:The test mixing corpus data of no label is converted to the mixing corpus data of character level.
Step 2:The deep learning model trained using the training stage is predicted.
The present invention mainly has advantages below:
Mixing language material segmenting method of the invention based on Bi-LSTM, using character level rather than word-level, can evade not Word problem is logged in, precision improves a lot.Directly model training is carried out using mixing language material, it is not necessary to will mix the every of language material Individual languages are detected and separated, and eventually arrive at the purpose that can identify mixing language material.
Brief description of the drawings
Fig. 1 is the inventive method flow chart.
Fig. 2 is deep learning model support composition.
Embodiment
To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's accompanying drawing to make Describe in detail as follows.
The method flow of the present invention is as shown in figure 1, it includes:
(1) training stage:
Step 1:Original training mixing corpus data Original_Corpus is converted into the mixing corpus data of character level Corpus_by_Char.Specially:Using BMESO (Begin, Middle, End, Single, Other) mark mode, by original Begin to train each word with label in mixing corpus data according to character level cutting.If label corresponding to some word is Label, the then character marking most started positioned at word are Label_B, and the character marking among word is Label_M, position Character marking in word end is Label_E, Label_S is labeled as if word only has an independent character, because word For language when Bi-LSTM models are entered, LSTM is substantially a neural network model, it is necessary to which it is fixed to keep inputting length , the situation inadequate to a sentence length is then supplemented with O (Other).In mixing language material participle, by multilingual according to upper State and require format flags and be mixed into a corpus data collections.
Step 2:The character in Corpus_by_Char is counted, a character set CharSet is obtained, for example, it is assumed that having two Individual word is:China, China, the character set after merging are just:In, China, state }.Will be every in character set CharSet Individual character is numbered according to natural number increasing, obtains character number set CharID corresponding to character set CharSet.Statistics The label of character in Corpus_by_Char, obtain a tag set LabelSet, similarly produced corresponding to label Number set LabelID.Tag set LabelSet is generally { B, M, E, S }, LabelID corresponding to generation, i.e., by tally set These characters closed in LabelSet change into a numeral to represent, and are easy to procedure identification.
Step 3:Corpus_by_Char is divided according to sentence length.If liThe sentence length of the i-th word is represented, then will |li-lj| < δ sentence is included into one group, and wherein δ represents sentence length interval.If the data after packet are GroupData, one N groups are set to altogether.
Step 4:BatchSize sentences, the word of each sentence are extracted in certain group sentence from GroupData that random nothing is put back to Symbol forms data w, and tag set corresponding to the character of the sentence is y, and the data w of extraction is converted to correspondingly by CharID Numbering, be fixed length data BatchData (correspond to Fig. 2 in w1,w2,…,wn), and the label y corresponding to Corresponding numbering is converted to by LabelID, is fixed the data y of lengthID.Because same group of sentence length approaches, phase For disorderly extracting, precision improves about 2 percentage points.
Step 5:By multiple data BatchData of step 4 and its corresponding label data yIDDeep learning is sent into together Model, produce penalty values Cost (y ', yID).Specific calculation formula is as follows:
Cost(y′,yID)=- yIDlog(y′)+(1-yID) log (1-y ') (formula 1)
Outputs of the wherein y ' expressions BatchData after deep learning category of model layer (SoftMax layers).Corresponding to figure Y1, y2 in 2 ..., yn
Step 6:The ginseng of loss function in deep learning model is trained using batch gradient descent algorithm (mini_batch) Number.
Step 7:If Cost (y ', y caused by deep learning modelID) penalty values that are calculated no longer reduce, or Reach maximum iteration N, then terminate the training of deep learning model;Otherwise step 4 is jumped to.
Wherein, Costi′(y′,yID) penalty values before expression during i iteration, Cost (y ', yID) represent that current iteration produces Penalty values, the expression of this formula is meant if current penalty values and the difference of the average value of preceding M penalty values are less than threshold Value θ, then it is assumed that no longer reduce.
Forecast period:
Step 1:Mixing corpus data PreData to be predicted is converted into the data format with Model Matching PreMData.Specially:Numerical data by mixing language material data conversion to be predicted into character level.
Step 2:The deep learning model that the PreMData feeding training stages are trained, and obtain segmenting prediction result OrgResult。
Described in training stage step 4:The data of extraction are converted to the data of several regular lengths by CharID BatchData, and corresponding label is converted to by LabelID the data y of several regular lengthsID.Specially:
Step 1:The data w being drawn into is converted into numeral, namely the corresponding relation by CharSet and CharID, by w In each character be converted into corresponding numeral.
Step 2:Tag set y corresponding to the data w of extraction is converted into numeral, namely by LabelSet with LabelID corresponding relation, each character in y is converted into corresponding numeral, obtains data yID
Step 3:Assuming that specific length is maxLen, as the data sentence length l < maxLen being drawn into, after sentence Face mends maxLen-l 0, obtains BatchData;MaxLen is equal to the positive LSTM unit numbers of Bi-LSTM layers;Because generally It is very long only less than 5% sentence length, if too taking notice of the sentence of those length, then precision can reduce much (if there is l >=maxLen situation, simple processing mode is directly to lose, or is divided into short sentence to enter long sentence Row processing).And by data y corresponding to wIDMaxLen-l 0 is mended below, obtains yID
The deep learning model support composition of the present invention is as shown in Fig. 2 described in training stage step 5:By data BatchData and its label data yIDDeep learning model is sent into, produces penalty values Cost (y ', yID), it is specially:
Step 1:Incoming data BatchData is subjected to vectorization in Embedding layers, also i.e. by each data Each character in BatchData is converted into corresponding to character ID numberings by the vector table Char2Vec of a character steering volume Vector.Each character has a corresponding character ID numbering in vector table Char2Vec.
Step 2:The vector that step 1 is obtained is passed to Bi-LSTM layers, is in detail:By in every data BatchData Vectorial w corresponding to one character1First positive LSTM unit of incoming Bi-LSTM layers, vectorial w corresponding to second character2Pass Enter second positive LSTM unit of Bi-LSTM layers, the like;The input of i-th positive of LSTM unit is except every simultaneously Vector is outer corresponding to i-th of character in data, also includes the output of the i-th -1 positive LSTM unit, i.e. the of forward direction I-th of LSTM unit of the output result input forward direction of i-1 LSTM unit.Per the first character in data BatchData First reverse LSTM unit of the incoming Bi-LSTM layers of vector corresponding to symbol, the incoming Bi-LSTM of vector corresponding to second character Second reverse LSTM unit of layer, the like;The input of i-th simultaneously reverse of LSTM unit is except in every data Vector is outer corresponding to i-th of character, also includes the output of the i-th -1 reverse LSTM unit, i.e., the i-th -1 reverse LSTM is mono- I-th of the LSTM unit of the output result input of member reversely.Pay attention to, the vector that each LSTM units once receive not is only There is one, but BatchSize is individual.
Step 3:By the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, that is to say by The output result of the LSTM units of forward and reverse, which is stitched together, to be combined into
Step 4:The output of Concatenate layers is passed to DropOut layers, that is to say random by hiMiddle η (0≤η≤1) Image watermarking fall, do not allow its continuation to be transmitted backward.
Step 5:DropOut output is passed to SoftMax layers, and by Softmax output y ' and the corresponding mark being passed to Sign data yIDProduce final penalty values Cost (y ', yID), specific calculate sees formula 1.
The deep learning model that deep learning model described in forecast period step 2, as training stage train, but In prediction, parameter η=1 for the DropOut layers being directed to, represent not hide any data, be all delivered to next Layer.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area Technical scheme can be modified by member or equivalent substitution, without departing from the spirit and scope of the present invention, this hair Bright protection domain should be to be defined described in claims.

Claims (8)

1. a kind of mixing language material segmenting method based on Bi-LSTM, its step include:
1) the mixing corpus data Corpus_by_ that mixing corpus data Original_Corpus will be trained to be converted into character level Char;
2) the mixing corpus data Corpus_by_Char characters are counted and obtain a character set CharSet, and to the character set Close each character in CharSet to be numbered, obtain character number set CharID corresponding to character set CharSet;System The label of the character in mixing corpus data Corpus_by_Char is counted, a tag set LabelSet is obtained, to the label Set LabelSet label is numbered, and obtains corresponding tag number set LabelID;
3) the mixing corpus data Corpus_by_Char is divided according to sentence length, obtains some sentences;Then according to sentence Sub- length is grouped to obtained sentence, obtains the data acquisition system GroupData for including n group sentences;
4) it is random that sentence packet is chosen from data acquisition system GroupData without what is put back to, extracted from sentence packet BatchSize sentence, the character of each sentence form a data w, and tag set corresponding to the character of the sentence is y;Root Data w is converted into corresponding numbering according to character number set CharID, obtains data BatchData;According to tag number collection Close LabelID and the label in set y is converted into corresponding numbering, obtain data yID
5) by the multiple data BatchData and its corresponding label data y of step 4) generationIDDeep learning model is sent into together Bi-LSTM, deep learning Model B i-LSTM parameter is trained, as penalty values Cost (y ', y caused by deep learning modelID) Satisfaction imposes a condition or reaches maximum iteration N, then terminates the training of deep learning model, the depth after being trained Practise Model B i-LSTM;Otherwise data BatchData is regenerated using the method for step 4) and trains deep learning Model B i- LSTM;
6) mixing corpus data PreData to be predicted is converted into the data matched with deep learning Model B i-LSTM PreMData, and the deep learning Model B i-LSTM trained is sent to, obtain word segmentation result OrgResult.
2. the method as described in claim 1, it is characterised in that data BatchData length is a regular length MaxLen, as the data sentence length l < maxLen being drawn into, maxLen-l 0 will be mended behind the sentence, obtained BatchData;And by corresponding data yIDMaxLen-l 0 is mended below, obtains data yID;Wherein, maxLen is equal to the depth The LSTM unit numbers of Bi-LSTM layers forward direction in learning model Bi-LSTM.
3. method as claimed in claim 2, it is characterised in that produce penalty values Cost (y ', yID) method be:
31) Embedding layers of the data BatchData in deep learning Model B i-LSTM is subjected to vectorization, by data Each character in BatchData is converted into a vector;
32) by the incoming deep learning Model B i-LSTM of vector corresponding to each data BatchData Bi-LSTM layers, the wherein number It is passed to a LSTM units of Bi-LSTM layer forward and reverses respectively according to vector corresponding to each character in BatchData;And just To positive i-th of the LSTM unit of output result input of the i-th -1 LSTM unit, the i-th -1 reverse LSTM unit I-th reverse of LSTM unit of output result input;
33) by the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, i.e., by forward and reverse The output results of LSTM units be stitched together and be combined intoAnd incoming DropOut layers;
35) by the output of the DropOut layers after the processing of SoftMax layers, by obtained output y ' and incoming data yIDTogether Calculate and produce penalty values Cost (y ', yID)。
4. method as claimed in claim 3, it is characterised in that penalty values Cost (y ', the yID)=- yIDlog(y′)+(1- yID)log(1-y′);Outputs of the wherein y ' expressions data BatchData after the SoftMax layers.
5. the method as described in claim 1, it is characterised in that it is described impose a condition for:The penalty values Cost currently calculated (y′,yID) with the difference of the average value of preceding m penalty values it is less than threshold θ.
6. the method as described in claim 1, it is characterised in that in the step 2), will | li-lj| < δ sentence is included into one Group;Wherein, liRepresent sentence length, the l of the i-th wordjThe sentence length of jth word is represented, δ represents sentence length interval.
7. the method as described in claim 1, it is characterised in that in the step 1), instructed this using BMESO mark mode Practice each word with label in mixing corpus data Original_Corpus according to character level cutting, if a word pair The label answered is Label, then the character marking most started positioned at the word is Label_B, the character marking among word For Label_M, the character marking positioned at word end is Label_E, is labeled as if word only has an independent character Label_S。
8. the method as described in claim 1, it is characterised in that train the deep learning model using Adam gradient descent algorithms Bi-LSTM-CNN parameter.
CN201710946891.3A 2017-10-12 2017-10-12 A kind of mixing language material segmenting method based on Bi LSTM Withdrawn CN107894976A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710946891.3A CN107894976A (en) 2017-10-12 2017-10-12 A kind of mixing language material segmenting method based on Bi LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710946891.3A CN107894976A (en) 2017-10-12 2017-10-12 A kind of mixing language material segmenting method based on Bi LSTM

Publications (1)

Publication Number Publication Date
CN107894976A true CN107894976A (en) 2018-04-10

Family

ID=61802544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710946891.3A Withdrawn CN107894976A (en) 2017-10-12 2017-10-12 A kind of mixing language material segmenting method based on Bi LSTM

Country Status (1)

Country Link
CN (1) CN107894976A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm
CN110245332A (en) * 2019-04-22 2019-09-17 平安科技(深圳)有限公司 Chinese character code method and apparatus based on two-way length memory network model in short-term
CN111126037A (en) * 2019-12-18 2020-05-08 昆明理工大学 Thai sentence segmentation method based on twin cyclic neural network
US11966699B2 (en) 2021-06-17 2024-04-23 International Business Machines Corporation Intent classification using non-correlated features

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105243055A (en) * 2015-09-28 2016-01-13 北京橙鑫数据科技有限公司 Multi-language based word segmentation method and apparatus
US20160086280A1 (en) * 2013-06-02 2016-03-24 Data Scientist Corp. Evaluation method, evaluation device, and program
CN105740226A (en) * 2016-01-15 2016-07-06 南京大学 Method for implementing Chinese segmentation by using tree neural network and bilateral neural network
CN105912533A (en) * 2016-04-12 2016-08-31 苏州大学 Method and device for long statement segmentation aiming at neural machine translation
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN107168957A (en) * 2017-06-12 2017-09-15 云南大学 A kind of Chinese word cutting method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160086280A1 (en) * 2013-06-02 2016-03-24 Data Scientist Corp. Evaluation method, evaluation device, and program
CN105243055A (en) * 2015-09-28 2016-01-13 北京橙鑫数据科技有限公司 Multi-language based word segmentation method and apparatus
CN105740226A (en) * 2016-01-15 2016-07-06 南京大学 Method for implementing Chinese segmentation by using tree neural network and bilateral neural network
CN105912533A (en) * 2016-04-12 2016-08-31 苏州大学 Method and device for long statement segmentation aiming at neural machine translation
CN106682411A (en) * 2016-12-22 2017-05-17 浙江大学 Method for converting physical examination diagnostic data into disease label
CN107168957A (en) * 2017-06-12 2017-09-15 云南大学 A kind of Chinese word cutting method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GRZEGORZ CHRUPA LA: "Text segmentation with character-level text embeddings", 《WORKSHOP ON DEEP LEARNING FOR AUDIO, SPEECH AND LANGUAGE PROCESSING,ICML2013》 *
ONUR KURU等: "CharNER: Character-Level Named Entity Recognition", 《THE 26TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS: TECHNICAL PAPERS》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109388806A (en) * 2018-10-26 2019-02-26 北京布本智能科技有限公司 A kind of Chinese word cutting method based on deep learning and forgetting algorithm
CN109388806B (en) * 2018-10-26 2023-06-27 北京布本智能科技有限公司 Chinese word segmentation method based on deep learning and forgetting algorithm
CN110245332A (en) * 2019-04-22 2019-09-17 平安科技(深圳)有限公司 Chinese character code method and apparatus based on two-way length memory network model in short-term
CN110245332B (en) * 2019-04-22 2024-03-15 平安科技(深圳)有限公司 Chinese coding method and device based on bidirectional long-short-term memory network model
CN111126037A (en) * 2019-12-18 2020-05-08 昆明理工大学 Thai sentence segmentation method based on twin cyclic neural network
US11966699B2 (en) 2021-06-17 2024-04-23 International Business Machines Corporation Intent classification using non-correlated features

Similar Documents

Publication Publication Date Title
CN111444721B (en) Chinese text key information extraction method based on pre-training language model
CN110032648B (en) Medical record structured analysis method based on medical field entity
CN109213861B (en) Traveling evaluation emotion classification method combining At _ GRU neural network and emotion dictionary
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN112101028B (en) Multi-feature bidirectional gating field expert entity extraction method and system
CN106570179B (en) A kind of kernel entity recognition methods and device towards evaluation property text
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN104794169B (en) A kind of subject terminology extraction method and system based on sequence labelling model
CN106096664B (en) A kind of sentiment analysis method based on social network data
CN111709242B (en) Chinese punctuation mark adding method based on named entity recognition
CN106844741A (en) A kind of answer method towards specific area
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN107045496A (en) The error correction method and error correction device of text after speech recognition
CN105975454A (en) Chinese word segmentation method and device of webpage text
CN107122349A (en) A kind of feature word of text extracting method based on word2vec LDA models
CN107797987B (en) Bi-LSTM-CNN-based mixed corpus named entity identification method
CN109284400A (en) A kind of name entity recognition method based on Lattice LSTM and language model
CN107832289A (en) A kind of name entity recognition method based on LSTM CNN
CN107145514B (en) Chinese sentence pattern classification method based on decision tree and SVM mixed model
CN107894976A (en) A kind of mixing language material segmenting method based on Bi LSTM
CN107967251A (en) A kind of name entity recognition method based on Bi-LSTM-CNN
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN107894975A (en) A kind of segmenting method based on Bi LSTM
CN107797986A (en) A kind of mixing language material segmenting method based on LSTM CNN
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20180410