CN107894976A - A kind of mixing language material segmenting method based on Bi LSTM - Google Patents
A kind of mixing language material segmenting method based on Bi LSTM Download PDFInfo
- Publication number
- CN107894976A CN107894976A CN201710946891.3A CN201710946891A CN107894976A CN 107894976 A CN107894976 A CN 107894976A CN 201710946891 A CN201710946891 A CN 201710946891A CN 107894976 A CN107894976 A CN 107894976A
- Authority
- CN
- China
- Prior art keywords
- data
- lstm
- character
- sentence
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of mixing language material segmenting method based on Bi LSTM.This method is:The corpus data that mixing corpus data will be trained to be converted into character level;Count the corpus data character obtain a character set merge each character is numbered, obtain character number set;Statistics alphanumeric tag obtains a tag set, and tag number set is obtained to tag number;Language material is divided according to sentence length, obtained sentence is grouped according to sentence length, obtains data acquisition system;It is random that sentence packet is chosen from the data acquisition system without what is put back to, multiple sentences are therefrom extracted, the character of each sentence forms a data w, and corresponding tag set is y;Data w is converted into corresponding numbering and label y is sent into Model B i LSTM, trains the parameter of the deep learning model;By data conversion to be predicted into the data with the deep learning Model Matching, and the deep learning model trained is sent to, obtains word segmentation result.
Description
Technical field
The invention belongs to computer software technical field, is related to a kind of mixing language material segmenting method based on Bi-LSTM.
Background technology
Bi-LSTM, English full name are:Bi-directional Long Short-Term Memory, Chinese are meant
Refer to, two-way shot and long term Memory Neural Networks.
Mix language material, in this patent, refer to training or the data predicted in contain the languages of at least two language
Expect data.
Participle (Word Segment) refer to input continuation character string according to semanteme information marked into it is continuous
Sequence label.In this patent, word (simplified form of Chinese Character, Chinese-traditional, the Korean and Japanese) sequence number to Asia type referred to
According to being cut into single word one by one, and the segmentation between its word and word is used as using space.
The professional knowledge that the method for the participle of mixing language material is related to has two aspects:On the one hand it is by the data of a variety of language materials
Form carries out unification according to character level otherwise;On the other hand the professional knowledge being related to is mainly the sequence in natural language processing
Row mark (sequential labeling) refers to using a sequence as input, and trains a model to be each sequence
Column-slice segment data produces correctly output.
For the segmenting method of multilingual, traditional process is:Multilingual input text -->(segmentation or subordinate sentence) text language
Speech checks -->Participle.
Inspection to text language is firstly the need of the inspection for firstly the need of the granularity for determining to check, being chapter rank, still
Multiple two kinds or more language are included for a document and just occur that detection is inaccurate, so as to only handle a kind of language and neglect
Slightly another language.Now with regard to needing to carry out more fine-grained division, segmentation or subordinate sentence do language detection.Present invention mixing language material
Participle can simplify and be to the segmenting method of traditional multilingual:Multilingual input text -->Participle.So as to avoid
The process of segmentation, subordinate sentence and text language detection.
Meanwhile the method for hybrid language participle involved in the present invention, its application scenarios also include:
1. the full-text index in multilingual search engine:An important function is exactly to do the full text of document in search engine
Index, its content is to be segmented word, and the word segmentation result of document and document then are formed into an inverted index, Yong Hu
It is also first to be segmented the read statement of inquiry when inquiry, then carries out the result of participle and index data base pair
Than so as to find out the document the most similar to currently inputting.
2. multilingual autoabstract generation:Autoabstract refers to a longer document with one section of shorter spoken and written languages
Go to summarize.And during summary, it is necessary to keyword in a document is calculated, therefore must be first before keyword is calculated
Word segmentation processing is done to document.
3. multilingual automatic Proofreading:Automatic Proofreading refers to the inspection for making syntax error to passage, its granularity checked
Or the inspection that word-based mode is done, it is therefore desirable to which the continuous word that user inputs is done into word segmentation processing.
The step of traditional segmenting method to the mixing text comprising multilingual:
Multilingual input text -->(segmentation or subordinate sentence) text language inspection -->Participle
And its participle to each language can use two kinds of sides of the participle based on dictionary and the participle based on statistics
Formula.Participle based on dictionary is will to search possible participle to be all included in a dictionary, then there is Forward Maximum Method or forward direction
The mode of smallest match is cut by dictionary vocabulary.Another segmenting method based on statistics, its principle are about:Count phase
The frequency that adjacent word occurs, if frequency exceedes the word that given threshold value is taken as a regular collocation, and as one
Participle unit.The shortcomings that it is present be:
Shortcoming 1:To the multilingual bad differentiation of detection granularity, and participle can be caused because certain language does not detect
The loss of precision.Multilingual is included for a document, it is necessary first to segment processing, class of languages then is done to each paragraph
The detection of type, but if also including the situation of multilingual to being included in paragraph, need to do the processing of subordinate sentence again, in sentence
It all can not do to do again comprising multilingual and split.Model and the serious dependence of language material because of participle, as a result just occur because of certain
Kind language detects and loses the information of participle.Break the participle model dependence serious with language material in the present invention to ask
Topic, the language material of multilingual is mixed, and a unified model is trained on the basis of language material is mixed.
Shortcoming 2:Method based on dictionary excessively relies on dictionary, it is impossible to was not occurred according to the identification of the information of semanteme in dictionary
Unregistered word.
Shortcoming 3:The mode for being currently based on statistics is mainly HMM (hidden horse paediatrics husband) models and CRF (condition random field) mould
Type because the complexity calculated, be between its current word only considered and a upper word it is associated, remaining word and word it
Between be conditional sampling, this is inconsistent with reality, thus its participle precision have the space further lifted.
In the present invention, invention introduces Bi-LSTM methods, on the one hand this method is Statistics-Based Method, can break word-based
The problem of method of allusion quotation is to dictionary heavy dependence.On the other hand, should be Bi-LSTM is length memory models, and the model is also broken
Current word based on conventional statistics it is only related to a upper word it is assumed that but word all closes up and down in current statement, so as to inhale
More semantic informations are taken.
The content of the invention
For technical problem present in prior art, it is an object of the invention to provide a kind of mixed based on Bi-LSTM
Language material segmenting method is closed, core of the invention includes two parts:
Part 1:The unification of multilingual mixing language material form
Multilingual participle is needed first to do the test problems of language form in order to evade, proposed in the present invention based on word
Accord with rank segmenting method, and will it is more in language form mixing language material be put into together in deep learning model, be trained.
Part 2:Lift the precision of multilingual participle
In order to lift the precision of multilingual participle, it only with a upper word is phase to overcome in CRF (condition random field) current word
The situation of mutual correlation, we introduce the length Memory Neural Networks model (LSTM) of deep learning so that current word and sentence
In before all words be all mutually related.
The technical scheme is that:
A kind of mixing language material segmenting method based on Bi-LSTM, its step include:
1) the mixing corpus data Corpus_ that mixing corpus data Original_Corpus will be trained to be converted into character level
by_Char;
2) the mixing corpus data Corpus_by_Char characters are counted and obtain a character set CharSet, and to the word
Each character is numbered in symbol set CharSet, obtains character number set corresponding to character set CharSet
CharID;The label of the character in mixing corpus data Corpus_by_Char is counted, obtains a tag set LabelSet,
Tag set LabelSet label is numbered, obtains corresponding tag number set LabelID;
3) the mixing corpus data Corpus_by_Char is divided according to sentence length, obtains some sentences;Then root
Obtained sentence is grouped according to sentence length, obtains the data acquisition system GroupData for including n group sentences;
4) it is random that sentence packet is chosen from data acquisition system GroupData without what is put back to, taken out from sentence packet
BatchSize sentence is taken, the character of each sentence forms a data w, and tag set corresponding to the character of the sentence is y;
Data w is converted to by corresponding numbering according to character number set CharID, obtains data BatchData;According to tag number
Label in set y is converted to corresponding numbering by set LabelID, obtains data yID;
5) by the multiple data BatchData and its corresponding label data y of step 4) generationIDDeep learning is sent into together
Model B i-LSTM, deep learning Model B i-LSTM parameter is trained, as penalty values Cost caused by deep learning model
(y′,yID) reach and impose a condition or reach maximum iteration N, then the training of deep learning model is terminated, after obtaining training
Deep learning Model B i-LSTM;Otherwise data BatchData is regenerated using the method for step 4) and trains the deep learning
Model B i-LSTM;
6) mixing corpus data PreData to be predicted is converted into the number matched with deep learning Model B i-LSTM
According to PreMData, and the deep learning Model B i-LSTM trained is sent to, obtains word segmentation result OrgResult.
Further, data BatchData length is a regular length maxLen, when the data sentence length being drawn into
When spending l < maxLen, maxLen-l 0 will be mended behind the sentence, obtain BatchData;And by corresponding data yIDMend below
MaxLen-l 0, obtain data yID;Wherein, the Bi-LSTM layers that maxLen is equal in deep learning Model B i-LSTM are positive
LSTM unit numbers.
Further, penalty values Cost (y ', y are producedID) method be:
31) Embedding layers of the data BatchData in deep learning Model B i-LSTM is subjected to vectorization, by data
Each character in BatchData is converted into a vector;
32) by the incoming deep learning Model B i-LSTM of vector corresponding to each data BatchData Bi-LSTM layers, wherein
Vector corresponding to each character in data BatchData is passed to a LSTM units of Bi-LSTM layer forward and reverses respectively;
And i-th of LSTM unit, the i-th -1 reverse LSTM of the output result input forward direction of the i-th -1 positive LSTM unit are mono-
I-th of the LSTM unit of the output result input of member reversely;
33) by the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, will it is positive and
The output result of reverse LSTM units, which is stitched together, to be combined intoAnd incoming DropOut layers;
35) by the output of the DropOut layers after the processing of SoftMax layers, by obtained output y ' and incoming data yID
Calculate together and produce penalty values Cost (y ', yID)。
Further, penalty values Cost (y ', the yID)=- yIDlog(y′)+(1-yID)log(1-y′);Wherein y ' tables
Output of the registration according to BatchData after the SoftMax layers.
Further, it is described impose a condition for:Penalty values Cost (y ', the y currently calculatedID) flat with preceding m penalty values
The difference of average is less than threshold θ.
Further, in the step 2), will | li-lj| < δ sentence is included into one group;Wherein, liRepresent the i-th word
Sentence length, ljThe sentence length of jth word is represented, δ represents sentence length interval.
Further, in the step 1), using BMESO mark mode by the training mixing corpus data
Each word with label in Original_Corpus is according to character level cutting, if label corresponding to a word is
Label, the then character marking most started positioned at the word are Label_B, and the character marking among word is Label_M,
Character marking positioned at word end is Label_E, and Label_S is labeled as if word only has an independent character.
Further, deep learning Model B i-LSTM-CNN parameter is trained using Adam gradient descent algorithms.
The flow of the inventive method such as Fig. 1, in two stages:Training stage, forecast period.
(1) training stage:(left side dotted line frame of flow chart)
Step 1:Training mixing language material data with label are converted to the mixing corpus data of character level.
Step 2:Deep learning model is trained using Adam gradient descent algorithms.
(2) forecast period:(the right dotted line frame of flow chart)
Step 1:The test mixing corpus data of no label is converted to the mixing corpus data of character level.
Step 2:The deep learning model trained using the training stage is predicted.
The present invention mainly has advantages below:
Mixing language material segmenting method of the invention based on Bi-LSTM, using character level rather than word-level, can evade not
Word problem is logged in, precision improves a lot.Directly model training is carried out using mixing language material, it is not necessary to will mix the every of language material
Individual languages are detected and separated, and eventually arrive at the purpose that can identify mixing language material.
Brief description of the drawings
Fig. 1 is the inventive method flow chart.
Fig. 2 is deep learning model support composition.
Embodiment
To enable the features described above of the present invention and advantage to become apparent, special embodiment below, and coordinate institute's accompanying drawing to make
Describe in detail as follows.
The method flow of the present invention is as shown in figure 1, it includes:
(1) training stage:
Step 1:Original training mixing corpus data Original_Corpus is converted into the mixing corpus data of character level
Corpus_by_Char.Specially:Using BMESO (Begin, Middle, End, Single, Other) mark mode, by original
Begin to train each word with label in mixing corpus data according to character level cutting.If label corresponding to some word is
Label, the then character marking most started positioned at word are Label_B, and the character marking among word is Label_M, position
Character marking in word end is Label_E, Label_S is labeled as if word only has an independent character, because word
For language when Bi-LSTM models are entered, LSTM is substantially a neural network model, it is necessary to which it is fixed to keep inputting length
, the situation inadequate to a sentence length is then supplemented with O (Other).In mixing language material participle, by multilingual according to upper
State and require format flags and be mixed into a corpus data collections.
Step 2:The character in Corpus_by_Char is counted, a character set CharSet is obtained, for example, it is assumed that having two
Individual word is:China, China, the character set after merging are just:In, China, state }.Will be every in character set CharSet
Individual character is numbered according to natural number increasing, obtains character number set CharID corresponding to character set CharSet.Statistics
The label of character in Corpus_by_Char, obtain a tag set LabelSet, similarly produced corresponding to label
Number set LabelID.Tag set LabelSet is generally { B, M, E, S }, LabelID corresponding to generation, i.e., by tally set
These characters closed in LabelSet change into a numeral to represent, and are easy to procedure identification.
Step 3:Corpus_by_Char is divided according to sentence length.If liThe sentence length of the i-th word is represented, then will
|li-lj| < δ sentence is included into one group, and wherein δ represents sentence length interval.If the data after packet are GroupData, one
N groups are set to altogether.
Step 4:BatchSize sentences, the word of each sentence are extracted in certain group sentence from GroupData that random nothing is put back to
Symbol forms data w, and tag set corresponding to the character of the sentence is y, and the data w of extraction is converted to correspondingly by CharID
Numbering, be fixed length data BatchData (correspond to Fig. 2 in w1,w2,…,wn), and the label y corresponding to
Corresponding numbering is converted to by LabelID, is fixed the data y of lengthID.Because same group of sentence length approaches, phase
For disorderly extracting, precision improves about 2 percentage points.
Step 5:By multiple data BatchData of step 4 and its corresponding label data yIDDeep learning is sent into together
Model, produce penalty values Cost (y ', yID).Specific calculation formula is as follows:
Cost(y′,yID)=- yIDlog(y′)+(1-yID) log (1-y ') (formula 1)
Outputs of the wherein y ' expressions BatchData after deep learning category of model layer (SoftMax layers).Corresponding to figure
Y1, y2 in 2 ..., yn。
Step 6:The ginseng of loss function in deep learning model is trained using batch gradient descent algorithm (mini_batch)
Number.
Step 7:If Cost (y ', y caused by deep learning modelID) penalty values that are calculated no longer reduce, or
Reach maximum iteration N, then terminate the training of deep learning model;Otherwise step 4 is jumped to.
Wherein, Costi′(y′,yID) penalty values before expression during i iteration, Cost (y ', yID) represent that current iteration produces
Penalty values, the expression of this formula is meant if current penalty values and the difference of the average value of preceding M penalty values are less than threshold
Value θ, then it is assumed that no longer reduce.
Forecast period:
Step 1:Mixing corpus data PreData to be predicted is converted into the data format with Model Matching
PreMData.Specially:Numerical data by mixing language material data conversion to be predicted into character level.
Step 2:The deep learning model that the PreMData feeding training stages are trained, and obtain segmenting prediction result
OrgResult。
Described in training stage step 4:The data of extraction are converted to the data of several regular lengths by CharID
BatchData, and corresponding label is converted to by LabelID the data y of several regular lengthsID.Specially:
Step 1:The data w being drawn into is converted into numeral, namely the corresponding relation by CharSet and CharID, by w
In each character be converted into corresponding numeral.
Step 2:Tag set y corresponding to the data w of extraction is converted into numeral, namely by LabelSet with
LabelID corresponding relation, each character in y is converted into corresponding numeral, obtains data yID。
Step 3:Assuming that specific length is maxLen, as the data sentence length l < maxLen being drawn into, after sentence
Face mends maxLen-l 0, obtains BatchData;MaxLen is equal to the positive LSTM unit numbers of Bi-LSTM layers;Because generally
It is very long only less than 5% sentence length, if too taking notice of the sentence of those length, then precision can reduce much
(if there is l >=maxLen situation, simple processing mode is directly to lose, or is divided into short sentence to enter long sentence
Row processing).And by data y corresponding to wIDMaxLen-l 0 is mended below, obtains yID。
The deep learning model support composition of the present invention is as shown in Fig. 2 described in training stage step 5:By data
BatchData and its label data yIDDeep learning model is sent into, produces penalty values Cost (y ', yID), it is specially:
Step 1:Incoming data BatchData is subjected to vectorization in Embedding layers, also i.e. by each data
Each character in BatchData is converted into corresponding to character ID numberings by the vector table Char2Vec of a character steering volume
Vector.Each character has a corresponding character ID numbering in vector table Char2Vec.
Step 2:The vector that step 1 is obtained is passed to Bi-LSTM layers, is in detail:By in every data BatchData
Vectorial w corresponding to one character1First positive LSTM unit of incoming Bi-LSTM layers, vectorial w corresponding to second character2Pass
Enter second positive LSTM unit of Bi-LSTM layers, the like;The input of i-th positive of LSTM unit is except every simultaneously
Vector is outer corresponding to i-th of character in data, also includes the output of the i-th -1 positive LSTM unit, i.e. the of forward direction
I-th of LSTM unit of the output result input forward direction of i-1 LSTM unit.Per the first character in data BatchData
First reverse LSTM unit of the incoming Bi-LSTM layers of vector corresponding to symbol, the incoming Bi-LSTM of vector corresponding to second character
Second reverse LSTM unit of layer, the like;The input of i-th simultaneously reverse of LSTM unit is except in every data
Vector is outer corresponding to i-th of character, also includes the output of the i-th -1 reverse LSTM unit, i.e., the i-th -1 reverse LSTM is mono-
I-th of the LSTM unit of the output result input of member reversely.Pay attention to, the vector that each LSTM units once receive not is only
There is one, but BatchSize is individual.
Step 3:By the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, that is to say by
The output result of the LSTM units of forward and reverse, which is stitched together, to be combined into
Step 4:The output of Concatenate layers is passed to DropOut layers, that is to say random by hiMiddle η (0≤η≤1)
Image watermarking fall, do not allow its continuation to be transmitted backward.
Step 5:DropOut output is passed to SoftMax layers, and by Softmax output y ' and the corresponding mark being passed to
Sign data yIDProduce final penalty values Cost (y ', yID), specific calculate sees formula 1.
The deep learning model that deep learning model described in forecast period step 2, as training stage train, but
In prediction, parameter η=1 for the DropOut layers being directed to, represent not hide any data, be all delivered to next
Layer.
Implement to be merely illustrative of the technical solution of the present invention rather than be limited above, the ordinary skill people of this area
Technical scheme can be modified by member or equivalent substitution, without departing from the spirit and scope of the present invention, this hair
Bright protection domain should be to be defined described in claims.
Claims (8)
1. a kind of mixing language material segmenting method based on Bi-LSTM, its step include:
1) the mixing corpus data Corpus_by_ that mixing corpus data Original_Corpus will be trained to be converted into character level
Char;
2) the mixing corpus data Corpus_by_Char characters are counted and obtain a character set CharSet, and to the character set
Close each character in CharSet to be numbered, obtain character number set CharID corresponding to character set CharSet;System
The label of the character in mixing corpus data Corpus_by_Char is counted, a tag set LabelSet is obtained, to the label
Set LabelSet label is numbered, and obtains corresponding tag number set LabelID;
3) the mixing corpus data Corpus_by_Char is divided according to sentence length, obtains some sentences;Then according to sentence
Sub- length is grouped to obtained sentence, obtains the data acquisition system GroupData for including n group sentences;
4) it is random that sentence packet is chosen from data acquisition system GroupData without what is put back to, extracted from sentence packet
BatchSize sentence, the character of each sentence form a data w, and tag set corresponding to the character of the sentence is y;Root
Data w is converted into corresponding numbering according to character number set CharID, obtains data BatchData;According to tag number collection
Close LabelID and the label in set y is converted into corresponding numbering, obtain data yID;
5) by the multiple data BatchData and its corresponding label data y of step 4) generationIDDeep learning model is sent into together
Bi-LSTM, deep learning Model B i-LSTM parameter is trained, as penalty values Cost (y ', y caused by deep learning modelID)
Satisfaction imposes a condition or reaches maximum iteration N, then terminates the training of deep learning model, the depth after being trained
Practise Model B i-LSTM;Otherwise data BatchData is regenerated using the method for step 4) and trains deep learning Model B i-
LSTM;
6) mixing corpus data PreData to be predicted is converted into the data matched with deep learning Model B i-LSTM
PreMData, and the deep learning Model B i-LSTM trained is sent to, obtain word segmentation result OrgResult.
2. the method as described in claim 1, it is characterised in that data BatchData length is a regular length
MaxLen, as the data sentence length l < maxLen being drawn into, maxLen-l 0 will be mended behind the sentence, obtained
BatchData;And by corresponding data yIDMaxLen-l 0 is mended below, obtains data yID;Wherein, maxLen is equal to the depth
The LSTM unit numbers of Bi-LSTM layers forward direction in learning model Bi-LSTM.
3. method as claimed in claim 2, it is characterised in that produce penalty values Cost (y ', yID) method be:
31) Embedding layers of the data BatchData in deep learning Model B i-LSTM is subjected to vectorization, by data
Each character in BatchData is converted into a vector;
32) by the incoming deep learning Model B i-LSTM of vector corresponding to each data BatchData Bi-LSTM layers, the wherein number
It is passed to a LSTM units of Bi-LSTM layer forward and reverses respectively according to vector corresponding to each character in BatchData;And just
To positive i-th of the LSTM unit of output result input of the i-th -1 LSTM unit, the i-th -1 reverse LSTM unit
I-th reverse of LSTM unit of output result input;
33) by the output of each LSTM units of forward and reverseWithIncoming Concatenate layers, i.e., by forward and reverse
The output results of LSTM units be stitched together and be combined intoAnd incoming DropOut layers;
35) by the output of the DropOut layers after the processing of SoftMax layers, by obtained output y ' and incoming data yIDTogether
Calculate and produce penalty values Cost (y ', yID)。
4. method as claimed in claim 3, it is characterised in that penalty values Cost (y ', the yID)=- yIDlog(y′)+(1-
yID)log(1-y′);Outputs of the wherein y ' expressions data BatchData after the SoftMax layers.
5. the method as described in claim 1, it is characterised in that it is described impose a condition for:The penalty values Cost currently calculated
(y′,yID) with the difference of the average value of preceding m penalty values it is less than threshold θ.
6. the method as described in claim 1, it is characterised in that in the step 2), will | li-lj| < δ sentence is included into one
Group;Wherein, liRepresent sentence length, the l of the i-th wordjThe sentence length of jth word is represented, δ represents sentence length interval.
7. the method as described in claim 1, it is characterised in that in the step 1), instructed this using BMESO mark mode
Practice each word with label in mixing corpus data Original_Corpus according to character level cutting, if a word pair
The label answered is Label, then the character marking most started positioned at the word is Label_B, the character marking among word
For Label_M, the character marking positioned at word end is Label_E, is labeled as if word only has an independent character
Label_S。
8. the method as described in claim 1, it is characterised in that train the deep learning model using Adam gradient descent algorithms
Bi-LSTM-CNN parameter.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710946891.3A CN107894976A (en) | 2017-10-12 | 2017-10-12 | A kind of mixing language material segmenting method based on Bi LSTM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710946891.3A CN107894976A (en) | 2017-10-12 | 2017-10-12 | A kind of mixing language material segmenting method based on Bi LSTM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107894976A true CN107894976A (en) | 2018-04-10 |
Family
ID=61802544
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710946891.3A Withdrawn CN107894976A (en) | 2017-10-12 | 2017-10-12 | A kind of mixing language material segmenting method based on Bi LSTM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107894976A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
CN110245332A (en) * | 2019-04-22 | 2019-09-17 | 平安科技(深圳)有限公司 | Chinese character code method and apparatus based on two-way length memory network model in short-term |
CN111126037A (en) * | 2019-12-18 | 2020-05-08 | 昆明理工大学 | Thai sentence segmentation method based on twin cyclic neural network |
US11966699B2 (en) | 2021-06-17 | 2024-04-23 | International Business Machines Corporation | Intent classification using non-correlated features |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105243055A (en) * | 2015-09-28 | 2016-01-13 | 北京橙鑫数据科技有限公司 | Multi-language based word segmentation method and apparatus |
US20160086280A1 (en) * | 2013-06-02 | 2016-03-24 | Data Scientist Corp. | Evaluation method, evaluation device, and program |
CN105740226A (en) * | 2016-01-15 | 2016-07-06 | 南京大学 | Method for implementing Chinese segmentation by using tree neural network and bilateral neural network |
CN105912533A (en) * | 2016-04-12 | 2016-08-31 | 苏州大学 | Method and device for long statement segmentation aiming at neural machine translation |
CN106682411A (en) * | 2016-12-22 | 2017-05-17 | 浙江大学 | Method for converting physical examination diagnostic data into disease label |
CN107168957A (en) * | 2017-06-12 | 2017-09-15 | 云南大学 | A kind of Chinese word cutting method |
-
2017
- 2017-10-12 CN CN201710946891.3A patent/CN107894976A/en not_active Withdrawn
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160086280A1 (en) * | 2013-06-02 | 2016-03-24 | Data Scientist Corp. | Evaluation method, evaluation device, and program |
CN105243055A (en) * | 2015-09-28 | 2016-01-13 | 北京橙鑫数据科技有限公司 | Multi-language based word segmentation method and apparatus |
CN105740226A (en) * | 2016-01-15 | 2016-07-06 | 南京大学 | Method for implementing Chinese segmentation by using tree neural network and bilateral neural network |
CN105912533A (en) * | 2016-04-12 | 2016-08-31 | 苏州大学 | Method and device for long statement segmentation aiming at neural machine translation |
CN106682411A (en) * | 2016-12-22 | 2017-05-17 | 浙江大学 | Method for converting physical examination diagnostic data into disease label |
CN107168957A (en) * | 2017-06-12 | 2017-09-15 | 云南大学 | A kind of Chinese word cutting method |
Non-Patent Citations (2)
Title |
---|
GRZEGORZ CHRUPA LA: "Text segmentation with character-level text embeddings", 《WORKSHOP ON DEEP LEARNING FOR AUDIO, SPEECH AND LANGUAGE PROCESSING,ICML2013》 * |
ONUR KURU等: "CharNER: Character-Level Named Entity Recognition", 《THE 26TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS: TECHNICAL PAPERS》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109388806A (en) * | 2018-10-26 | 2019-02-26 | 北京布本智能科技有限公司 | A kind of Chinese word cutting method based on deep learning and forgetting algorithm |
CN109388806B (en) * | 2018-10-26 | 2023-06-27 | 北京布本智能科技有限公司 | Chinese word segmentation method based on deep learning and forgetting algorithm |
CN110245332A (en) * | 2019-04-22 | 2019-09-17 | 平安科技(深圳)有限公司 | Chinese character code method and apparatus based on two-way length memory network model in short-term |
CN110245332B (en) * | 2019-04-22 | 2024-03-15 | 平安科技(深圳)有限公司 | Chinese coding method and device based on bidirectional long-short-term memory network model |
CN111126037A (en) * | 2019-12-18 | 2020-05-08 | 昆明理工大学 | Thai sentence segmentation method based on twin cyclic neural network |
US11966699B2 (en) | 2021-06-17 | 2024-04-23 | International Business Machines Corporation | Intent classification using non-correlated features |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444721B (en) | Chinese text key information extraction method based on pre-training language model | |
CN110032648B (en) | Medical record structured analysis method based on medical field entity | |
CN109213861B (en) | Traveling evaluation emotion classification method combining At _ GRU neural network and emotion dictionary | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN112101028B (en) | Multi-feature bidirectional gating field expert entity extraction method and system | |
CN106570179B (en) | A kind of kernel entity recognition methods and device towards evaluation property text | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN104794169B (en) | A kind of subject terminology extraction method and system based on sequence labelling model | |
CN106096664B (en) | A kind of sentiment analysis method based on social network data | |
CN111709242B (en) | Chinese punctuation mark adding method based on named entity recognition | |
CN106844741A (en) | A kind of answer method towards specific area | |
CN110598203A (en) | Military imagination document entity information extraction method and device combined with dictionary | |
CN107045496A (en) | The error correction method and error correction device of text after speech recognition | |
CN105975454A (en) | Chinese word segmentation method and device of webpage text | |
CN107122349A (en) | A kind of feature word of text extracting method based on word2vec LDA models | |
CN107797987B (en) | Bi-LSTM-CNN-based mixed corpus named entity identification method | |
CN109284400A (en) | A kind of name entity recognition method based on Lattice LSTM and language model | |
CN107832289A (en) | A kind of name entity recognition method based on LSTM CNN | |
CN107145514B (en) | Chinese sentence pattern classification method based on decision tree and SVM mixed model | |
CN107894976A (en) | A kind of mixing language material segmenting method based on Bi LSTM | |
CN107967251A (en) | A kind of name entity recognition method based on Bi-LSTM-CNN | |
CN111444704B (en) | Network safety keyword extraction method based on deep neural network | |
CN107894975A (en) | A kind of segmenting method based on Bi LSTM | |
CN107797986A (en) | A kind of mixing language material segmenting method based on LSTM CNN | |
CN112364623A (en) | Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180410 |