CN105955952A - Information extraction method based on bidirectional recurrent neural network - Google Patents
Information extraction method based on bidirectional recurrent neural network Download PDFInfo
- Publication number
- CN105955952A CN105955952A CN201610284717.2A CN201610284717A CN105955952A CN 105955952 A CN105955952 A CN 105955952A CN 201610284717 A CN201610284717 A CN 201610284717A CN 105955952 A CN105955952 A CN 105955952A
- Authority
- CN
- China
- Prior art keywords
- word
- prime
- leftarrow
- rightarrow
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 21
- 230000000306 recurrent effect Effects 0.000 title abstract description 16
- 238000000605 extraction Methods 0.000 title abstract description 5
- 230000002457 bidirectional effect Effects 0.000 title abstract 3
- 238000000034 method Methods 0.000 claims abstract description 50
- 230000008569 process Effects 0.000 claims abstract description 12
- 230000007935 neutral effect Effects 0.000 claims description 58
- 210000002569 neuron Anatomy 0.000 claims description 39
- 238000012549 training Methods 0.000 claims description 19
- 239000000284 extract Substances 0.000 claims description 11
- 238000013507 mapping Methods 0.000 claims description 9
- 238000011161 development Methods 0.000 claims description 8
- 239000000203 mixture Substances 0.000 claims description 5
- 230000001537 neural effect Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 8
- 238000007405 data analysis Methods 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000007547 defect Effects 0.000 abstract 1
- 230000000875 corresponding effect Effects 0.000 description 14
- 238000012545 processing Methods 0.000 description 9
- 230000011218 segmentation Effects 0.000 description 9
- 230000018109 developmental process Effects 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 4
- 238000003672 processing method Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 3
- 239000010931 gold Substances 0.000 description 3
- 229910052737 gold Inorganic materials 0.000 description 3
- 238000005457 optimization Methods 0.000 description 3
- UTRLJOWPWILGSB-UHFFFAOYSA-N 1-[(2,5-dioxopyrrol-1-yl)methoxymethyl]pyrrole-2,5-dione Chemical compound O=C1C=CC(=O)N1COCN1C(=O)C=CC1=O UTRLJOWPWILGSB-UHFFFAOYSA-N 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to the field of natural language processing, in particular to an information extraction method based on a bidirectional recurrent neural network, which applies the technology of the bidirectional recurrent neural network, automatically learns the characteristics of basic elements of a text, such as characters, words, punctuation marks and the like, and then carries out sequence modeling through the Recurrent Neural Network (RNN), thereby overcoming the defect that the characteristics need to be manually set in the traditional mode; moreover, the invention uses the RNN of two-way propagation, overcomes the problem of information asymmetry in the prediction process of the unidirectional recurrent neural network, and leads the classification judgment result of the natural language sequence to be recognized to depend on both the former information and the latter information; the method is particularly suitable for extracting the entity name in big data analysis, and has important application value in big data analysis.
Description
Technical field
Natural language processing field of the present invention, particularly to a kind of information retrieval side based on forward-backward recutrnce neutral net
Method.
Background technology
Along with along with the fast development of the Internet, creating web data substantial amounts of, disclosed, the most therefore facilitated various
New industry based on big data technique, such as the Internet medical treatment, Internet education, enterprise or individual's reference etc..These are mutual
The rise of Networking industries and prosperity be unable to do without substantial amounts of data message analysis;But from webpage, directly get data major part
Being all non-structured, in order to use these data, data cleansing work Cheng Liaoge major company expends the most ground of time energy
Side.And customizing messages extracts in the middle of data cleansing, the extraction particularly naming entity is again recurrent thing, such as looks forward to
Industry reference, modal task is exactly to extract the name of enterprise in the middle of big length text.
In addition to the common rule according to " provinces and cities+keyword+industry+type of organization " is named, there is also a large amount of
Exception, such as exabyte does not use provinces and cities as beginning, or exabyte to write a Chinese character in simplified form, may contract in informal text
The mode write occurs, this recall rate that directly results in the information analysis using traditional mode to carry out is the highest.Traditional from
So language processing method uses condition random field (CRF) that text carries out Series Modeling, and then carries out text analyzing identification and send out
Existing exabyte, uses condition random field, it is necessary first to carry out design construction feature templates according to the feature of entity to be identified.Character modules
Plate includes the state spies such as the single order word of specified window size context or multistage phrase, the prefix of word, suffix, part-of-speech tagging
Levy;The structure of feature templates takes time and effort very much, but recognition result is very big to the degree of dependence of feature templates;And manually arrange
Feature templates is often only in accordance with the feature of part sample, poor universality;And it is typically only capable to use the contextual information of local, respectively
The use of individual feature templates is also separate, it was predicted that can not rely on longer historic state information, also cannot utilize longer
Following information feedback corrects possible history mistake, it was predicted that process is complicated, it was predicted that result is difficult to global optimum.
The quality extracted for enterprise name, studies a set of method based on automatization's study and finds enterprise name
Flow process is of great value.
Summary of the invention
It is an object of the invention to overcome the above-mentioned deficiency in the presence of prior art, it is provided that a kind of based on forward-backward recutrnce god
Information extracting method through network.Utilize forward-backward recutrnce neutral net that the enterprise dominant title in text is predicted, this
Inventive method had not only relied on information but also relied on hereinafter information above when predicting enterprise dominant title, it was predicted that result achieve
Global optimization, the reliability of identification is higher;Moreover, by the processing mode of forward-backward recutrnce neutral net, it is not necessary to manually set
Put feature templates, save manpower and versatility is more preferable, can find and extract enterprise name in various types of texts, identify
The more traditional rule-based processing method of recall rate significantly improve.
In order to realize foregoing invention purpose, the invention provides techniques below scheme:
A kind of information extracting method based on forward-backward recutrnce neutral net, uses forward-backward recutrnce neutral net to identify and treats point
Enterprise dominant title in analysis text, comprises implemented below step:
(1) choose and there is the document of enterprise dominant title manually mark, by enterprise dominant name segment mark therein
It is designated as: beginning, mid portion, latter end, is irrelevant portions by the word marking beyond enterprise dominant title;
(2) by the word sequence in the training sample of artificial mark, first forward the most reversely inputs described forward-backward recutrnce
Training described forward-backward recutrnce neutral net in neutral net, described forward-backward recutrnce neutral net uses following forward algorithm public
Formula:
I is word or the dimension of word of vectorization, and H is the neuron number of hidden layer, and K is the individual of output layer neuron
Number, whereinFor the word of t vectorization or word in the value of i-th dimension degree,For forward input, (word sequence is the most extremely
Tail forward inputs described neutral net) time t described in the input of hidden layer neuron of forward-backward recutrnce neutral net,For instead
Forward-backward recutrnce neutral net hidden layer described in t when input (word sequence reversely inputs described neutral net from tail to head)
The input of neuron,The output of t hidden layer neuron when inputting for forward,For t hidden layer during reversely input
The output of neuron, θ () is the function that hidden layer neuron is input to output,Defeated for (first forward the most reversely inputs) t
Go out the input of layer neuron,For the output of t output layer neuron,It is a probit, represents kth neuron
Output valve relative to the ratio of K neuron output value summation,What the maximum neuron of value was corresponding is categorized as t
The corresponding word of described forward-backward recutrnce neural network prediction or the final classification of word.
Concrete,WithBeing each dimension values vector of being 0, T is the length of list entries.
(3) during the word sequence being analysed in text is input to described forward-backward recutrnce neutral net, through described two-way
The word sequence of input is classified, by the sequence belonging to enterprise name part adjacent in classification results by recurrent neural network
Corresponding word extracts as enterprise name entirety.
Concrete, the inventive method comprises the step that pending text carries out word segmentation processing, described pending text bag
Include mark text (text of artificial mark) and text to be analyzed.The word of correspondence will be formed after suitable for pending text participle
Sequence, provides convenient for subsequent treatment.
Further, the word sequence in the text needing mark is entered by described step (1) according to the result of word segmentation processing
Rower is noted, and enterprise name therein is labeled as according to word segmentation result segmentation: beginning, mid portion and latter end, will
Other word sequence is labeled as irrelevant portions.
Further, the inventive method realizes word or word in pending text sequence by structure dictionary mapping table
Vectorization.
Further, choosing the sample of 35% as development sample in mark text, the sample of 65% is training sample.
It is only remained in, in described forward-backward recutrnce neural network training process, the model that recognition accuracy in development set is the highest.
Compared with prior art, beneficial effects of the present invention: the present invention provides a kind of based on forward-backward recutrnce neutral net
Information extracting method, utilizes forward-backward recutrnce neutral net to be predicted the enterprise dominant title in text, the inventive method
The mode of use forward-backward recutrnce neutral net is when predicting enterprise name, in a forward algorithm first by text sequence the most extremely
Tail forward successively inputs in described forward-backward recutrnce neutral net, is more reversely input to described recurrent neural network from tail to head;
During forward and reverse input, the input signal of each moment forward-backward recutrnce neutral net also includes a moment recurrence god
Output signal through network.Information but also relied on hereinafter information, in advance above had so not only been relied on when predicting enterprise dominant title
The result surveyed achieves global optimization, and the reliability of identification is higher.And by the processing mode of forward-backward recutrnce neutral net, nothing
Feature templates need to be manually set, save manpower and versatility is more preferable, can find in various types of texts and extract enterprise
Title, and the abbreviation of recognizable enterprise name, abbreviation, the more traditional rule-based processing method of recall rate of identification significantly carries
Height, the inventive method can find in the internet data text of magnanimity and extract enterprise dominant title, in data analysis field
There is higher using value.
Accompanying drawing illustrates:
Fig. 1 be this information extracting method based on forward-backward recutrnce neutral net realize process schematic.
Fig. 2 be this information extracting method based on forward-backward recutrnce neutral net realize signal local schematic flow sheet.
Fig. 3 be this information extracting method based on forward-backward recutrnce neutral net realize signal flow schematic diagram.
Fig. 4 be this information extracting method embodiment 1 based on forward-backward recutrnce neutral net realize signal flow schematic diagram.
Detailed description of the invention
Below in conjunction with test example and detailed description of the invention, the present invention is described in further detail.But this should not understood
Scope for the above-mentioned theme of the present invention is only limitted to below example, and all technology realized based on present invention belong to this
The scope of invention.
The present invention provides a kind of information extracting method based on forward-backward recutrnce neutral net, utilizes forward-backward recutrnce neutral net
Enterprise dominant title in text to be identified is predicted.In order to realize foregoing invention purpose, the present invention provides following skill
Art scheme:
A kind of information extracting method based on forward-backward recutrnce neutral net, is identified by forward-backward recutrnce neutral net and waits to locate
Enterprise dominant title in reason text, comprises the step that realizes as shown in Figure 1:
(1) choose 3000 texts with enterprise dominant title manually to mark, by enterprise dominant title therein
Segmentation markers is: beginning, mid portion and latter end, is without GUAN spot by the word marking beyond enterprise dominant title
Point.Concrete, the beginning of the enterprise in text or organization name is labeled as B, mid portion is labeled as M, end portion
Divide and be labeled as E, other are not belonging to enterprise or institutional label character is N, use letter or numeral to carry out labelling literary composition
Word sequence is simple and is easily handled, and the operation for follow-up correlated series provides convenient.
(2) by the word sequence in the training sample of handmarking successively forward be reversely input to described two-way pass
Return in neutral net, train described forward-backward recutrnce neutral net;(input of described forward refer to by the word in sequence or word according to
Order before and after position, sequentially inputs in the recurrent neural network in corresponding moment, and described reverse input refers to the word in sequence
Or word inverted order, sequentially inputs in the recurrent neural net in corresponding moment) described two-way return the defeated of each current time of neutral net
Enter signal and also include that the output signal of forward-backward recutrnce neutral net described in the moment, forward and the input of reverse information all terminate
After, stop recurrence.
(3) during the word sequence being analysed in document is input to described forward-backward recutrnce neutral net, through described two-way
Recurrent neural network to input word sequence classify, identify respectively word sequence to be extracted type (N, B, M or
E), the word that the B M E sequence pair between in classification results two adjacent N is answered is extracted as enterprise name entirety.
Further, the inventive method comprises to pending text that (described pending text includes marking text and treating point
Analysis text) carry out the step of word segmentation processing.The word sequence of correspondence will be formed after suitable for pending text participle, currently available
Participle instrument a lot of such as: Stamford segmenter, ICTCLAS, Pan Gu's participle, cook's segmenter ... will relatively by participle
Long content of text resolves into relatively independent words unit, makes pending content of text discretization, serializing, for recurrent neural
The application of network provides basis.
Further, the enterprise dominant title in training sample, according to the result of word segmentation processing, is carried out by described step (1)
Corresponding mark.
Further, in order to identify that the enterprise in informal text is called for short, abridges, correspondence (mark sample can be chosen
This 1/3) comprise that enterprise is called for short, the informal text of abbreviation is labeled, such as by " night March 9, XXYY group
Company issue bulletin claim, intend with Hong Kong XXYY limited company of wholly-owned subsidiary as investment subject, provide funds 3,000,000 dollars and its
Other party sets up XX artificial intelligence scientific & technical corporation, shareholding equity 100,000,000 strands, Hong Kong XX accounting 15% jointly." through word segmentation processing it is: " 3
The moon/9 days/night/,/XX/YY/ group company/issue/bulletin/title/,/intend/with/wholly-owned/subsidiary/Hong Kong/XX/YY/ share
Company limited/be/invest/main body/,/investment/300/ ten thousand dollar/with/its other party/common/establishment/XX/ artificial intelligence/science and technology/
Company/,/total/capital stock/100,000,000 strand/,/Hong Kong/XX/ accounting/15%/." " XX/YY/ group company " therein is labeled as
" MME ", part " Hong Kong/XX/YY/ limited company " is labeled as " BMME " and " XX/ artificial intelligence/science and technology/company " is labeled as
" BMME ", is labeled as " Hong Kong/XX ": " BM ", and other words or word are labeled as N.This sample mark text i.e. includes
Enterprise's full name also includes that enterprise is called for short, and marks 1000 such samples, is used for training described forward-backward recutrnce neutral net to instruct
Practice described forward-backward recutrnce neutral net, similar structures can be identified after described forward-backward recutrnce neural metwork training is complete
Enterprise's full name and abbreviation.
Concrete, in described step (2), the described forward-backward recutrnce neutral net following forward algorithm formula of employing:
I is the word in word sequence or the dimension after term vector, and H is the neuron number of hidden layer, and K is output layer
The number of neuron, whereinFor the word of t vectorization or word in the value of i-th dimension degree,(word is inputted for forward
Sequence forward input neural network) time, the input (present invention of the hidden layer neuron of forward-backward recutrnce neutral net described in t
The moment sequence number of forward-backward recutrnce neutral net described in method is corresponding with the position number of input word sequence, the most described literary composition
Word sequence is in word or the word of the 3rd position, in the forward-backward recutrnce neutral net in the 3rd moment of corresponding input),For the most defeated
When entering (the reverse input neural network of word sequence), the input of the output layer neuron of forward-backward recutrnce neutral net described in t,The output of t hidden layer neuron when inputting for forward,For the output of t hidden layer neuron, θ during reversely input
() is the function that hidden layer neuron is input to output,Input for t output layer neuron, it can be seen thatCombine
During the input of t forward hidden layer neuron output signal and reversely input time the output signal of hidden layer neuron),
Result of calculation go ahead and propagate until described forward-backward recutrnce neutral net exports the classification results in this moment;So calculating
Not only combine historical series information but also combine following sequence information during the classification results of current time correspondence word or word, rely on
The contextual information of whole text and non-local information, reached global optimum so that predict the outcome.For t
The output of output layer neuron,It is a probit, represents that the output valve of kth neuron exports relative to K neuron
The ratio of value summation, generally takesWhat maximum neuron was corresponding is categorized as forward-backward recutrnce neural network prediction described in this moment
Final classification.When inputting for forwardWeight coefficient,During for reversely inputtingWeight coefficient,For
During forward inputWeight coefficient,During for reversely inputtingWeight coefficient,ForWeight coefficient,
ForWeight coefficient.
WithBeing each dimension values vector of being 0, T is the length of list entries.
According to this forward algorithm formula, the signal of the inventive method flow to as shown in Figure 2 and Figure 3 (wherein vec-a, vec-b,
Vec-c, vec-d, vec-e, vec-f, vec-g, vec-h, vec-i, vec-j, vec-k, vec-l, vec-m ... vec-z etc.
Represent the row vector of two-dimensional matrix in dictionary mapping table).
From above-mentioned forward algorithm formula it can be seen that the inventive method uses the mode of forward-backward recutrnce neutral net in prediction
During enterprise name, in a forward algorithm, first text sequence the most successively forward is inputted described recurrent neural network
In, be more reversely input to described recurrent neural network from tail to head;During forward and reverse input, each moment is two-way
The input signal of recurrent neural network includes the word of this moment vectorization or word signal and a upper moment recurrent neural network
Output signal, only when reversely input, described forward-backward recutrnce neutral net just exports the classification knot of this moment correspondence word or word
Really.So not only relied on information but also relied on hereinafter information above when predicting enterprise dominant title, it was predicted that result for realizing
Global optimization, the reliability of identification is higher.And by the processing mode of forward-backward recutrnce neutral net, it is not necessary to spy is manually set
Levy template, save manpower and versatility is more preferable, can find and extract enterprise name, calling together of identification in various types of texts
Return the more traditional rule-based processing method of rate to significantly improve.
Further, the present invention uses above-mentioned forward algorithm successively to transmit computing in described forward-backward recutrnce neutral net
Data, get identification (prediction) data at output layer, when the annotation results with training sample that predicts the outcome has deviation, logical
Cross error backpropagation algorithm classical in neutral net to each weight adjusting in neutral net, error back propagation method
Error back propagation step by step is shared all neurons of each layer, it is thus achieved that the error signal of each layer neuron, and then revise each
The weight of neuron.Successively transmitted operational data by forward algorithm, and gradually revised each neuron by backward algorithm
The process of weight is exactly the training process of neutral net;Repeat said process, until the accuracy predicted the outcome reaches setting
Threshold value, deconditioning, now it is believed that described forward-backward recutrnce neural network model is the most trained completes.
Further, in described step (3), by N B M in described forward-backward recutrnce neural network classification result ... E N, N B
M ... N, N M ... B M in E N ... E, B M ..., M ... word corresponding for E extracts as enterprise name entirety, wherein M ... for
The sequence of at least 1 M composition, this completes judgement and the extraction of enterprise name.So the inventive method is possible not only to sentence
Break and the enterprise name of naming rule: BM ... E, it is also possible to find that the enterprise in informal text is called for short: B M ..., M ... E, such as
Say that an enterprise name at document is: " Beijing XXXX company limited " then may be with " Beijing in informal text
XXXX " form occur, eliminate the crucial suffix that relied on during Conventional enterprise title is extracted: " enterprise ", " company ", " group "
Deng, and by the such abbreviation of the inventive method or abbreviation: B M ..., M ... E can also be extracted, and greatly improves
The recall rate that enterprise name finds, improves enterprise name and extracts information that is complete and that occur and fail to judge problem.
Further, the inventive method realizes word or the vector of word in pending text by dictionary mapping table
Change.Described dictionary mapping table is a two-dimensional matrix, the corresponding word of each of which row vector or word, and this row vector
Arrange when building this dictionary mapping table with the corresponding relation of words.
Further, the sample of artificial mark text randomly selects the sample of 35% as development sample, 65%
Sample is training sample.It is only remained in recognition accuracy in development set the highest in described forward-backward recutrnce neural network training process
Model, be possible to prevent the over-fitting of described forward-backward recutrnce neural metwork training, make training result towards more reasonably direction
Carry out;And development sample uses unified labeled standards with training sample, reduce unrelated complexity, make the result that development set is verified
It is relatively reliable,.
Embodiment 1
The such as following newsletter archive at Network Capture: " Chengdu AB controls interest the wholly-owned subsidiary of Group Plc
Chengdu AB Electronics Co., Ltd. intends associating Chengdu CDEF Science and Technology Ltd. and Chengdu ABEF big data gold is set up in 2 natural person's investments
Take company limited, the big data solution of business of financial service is provided for the financial institution based on bank." this section of text is made
The result carrying out participle by segmenter is as follows: " Chengdu/A/B/ is holding/and group/share/company limited/it/wholly-owned/subsidiary/one-tenth
All/A/B/ electronics/company limited/plan/associating/Chengdu/C/D/E/F/ science and technology/company limited/and/2//natural person/invest/set
Vertical/big data/gold/clothes/company limited of Chengdu/A/B/E/F//,/for/based on/bank///finance/mechanism/offer/gold
Melt/service// business/big data/solution/." define after word segmentation processing a length of 55 word sequence, by upper
State word sequence after the dictionary mapping table set in advance, define correspondence and comprise the sequence of 55 vector datas, by upper
State vector data sequence and sequentially input in the forward-backward recutrnce neutral net trained, pre-through described forward-backward recutrnce neutral net
Survey and finally export: the sequence (signal of BMMMMMENNNBMMMENNBMMMMMENNNNNNBMMMMMMMENNNNNNNNNNNNNNNN
Flow process as shown in Figure 4, wherein " vec-a ", " vec-b ", " vec-c ", " vec-d ", " vec-e ", " vec-f ", " vec-g ",
" vec-h ", " vec-i ", " vec-j ", " vec-k ", " vec-l ", " vec-m ", " vec-n " ... " vec-z " is that dictionary maps
Row vector corresponding in table), respectively will wherein corresponding to " BMMMMME ", " BMMME ", " BMMMMME ", " BMMMMMMME "
" Chengdu AB control interest group's share ", " Chengdu AB Electronics Co., Ltd. ", " Chengdu CDEF Science and Technology Ltd. ", " Chengdu ABEF is big
Data Jin Fu company limited " extract, just complete the extraction work of enterprise name in the text.
Claims (10)
1. an information extracting method based on forward-backward recutrnce neutral net, it is characterised in that use forward-backward recutrnce neutral net
Identify the enterprise dominant title in text to be analyzed.
2. the method for claim 1, it is characterised in that comprise implemented below step:
(1) choosing and have the document of enterprise dominant title as training sample, pedestrian's work of going forward side by side marks, by enterprise dominant therein
Name segment is labeled as: beginning, mid portion and latter end, is unrelated by the word marking beyond enterprise dominant title
Part;
(2) by the word sequence in the training sample of handmarking, it is neural that first forward the most reversely inputs described forward-backward recutrnce
In network, train described forward-backward recutrnce neutral net;
(3) word sequence being analysed in text, first forward the most reversely inputs the described forward-backward recutrnce neutral net trained
In, judge each word or the type of word in word sequence through forward-backward recutrnce neutral net, and the most adjacent is belonged to
The beginning of enterprise name, centre and the words corresponding to latter end extract as an entirety.
3. method as claimed in claim 2, it is characterised in that described forward-backward recutrnce neutral net uses following forward algorithm public
Formula:
I is word or the dimension of word of vectorization, and H is the neuron number of hidden layer, and K is the number of output layer neuron, its
InFor the word of t vectorization or word in the value of i-th dimension degree,Forward-backward recutrnce god described in t when inputting for forward
Through the input of the hidden layer neuron of network,Hidden layer god for forward-backward recutrnce neutral net described in t during reversely input
Through the input of unit,The output of t hidden layer neuron when inputting for forward,For t hidden layer god during reversely input
Through the output of unit, θ () is the function that hidden layer neuron is input to output,For the input of t output layer neuron,For
The output of t output layer neuron,It is a probit, represents that the output valve of kth neuron is relative to K neuron
The ratio of output valve summation.
4. method as claimed in claim 3, it is characterised in thatWithBeing each dimension values vector of being 0, wherein T is
The length of input word sequence.
5. the method as described in one of Claims 1-4, it is characterised in that comprise the process that pending text is carried out participle,
Described pending text includes marking text and text to be analyzed.
6. method as claimed in claim 5, it is characterised in that realize pending text sequence by building dictionary mapping table
Middle word or the vectorization of word, described dictionary mapping table is a matrix, the corresponding word of each row vector therein or word,
And what the corresponding relation of row vector and word or word was arranged when building described dictionary and mapping.
7. method as claimed in claim 6, it is characterised in that when carrying out data mark, by the enterprise in text to be marked
The beginning of title is labeled as B, mid portion is labeled as M, latter end is labeled as E, by the literary composition beyond enterprise dominant title
The irrelevant portions of word is labeled as N.
8. method as claimed in claim 7, it is characterised in that in described step (3), described forward-backward recutrnce neutral net is divided
N B M in class result ... E N, N B M ... N, N M ... B M in E N ... E, B M ..., M ... word corresponding for E is as enterprise name
Entirety extracts, wherein M ... be the sequence of at least 1 M composition.
9. method as claimed in claim 8, it is characterised in that choose the sample of 35% in mark text as exploitation sample
This, the sample of 65% is training sample.
10. method as claimed in claim 9, it is characterised in that only protect in described forward-backward recutrnce neural network training process
Stay the model that recognition accuracy in development set is the highest.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610284717.2A CN105955952A (en) | 2016-05-03 | 2016-05-03 | Information extraction method based on bidirectional recurrent neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610284717.2A CN105955952A (en) | 2016-05-03 | 2016-05-03 | Information extraction method based on bidirectional recurrent neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105955952A true CN105955952A (en) | 2016-09-21 |
Family
ID=56913391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610284717.2A Pending CN105955952A (en) | 2016-05-03 | 2016-05-03 | Information extraction method based on bidirectional recurrent neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105955952A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107507052A (en) * | 2017-07-17 | 2017-12-22 | 苏州凯联信息科技有限公司 | A kind of quotation information acquisition methods and device |
CN108090045A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method for building up of marking model, segmenting method and device |
CN108154191A (en) * | 2018-01-12 | 2018-06-12 | 北京经舆典网络科技有限公司 | The recognition methods of file and picture and system |
CN108182976A (en) * | 2017-12-28 | 2018-06-19 | 西安交通大学 | A kind of clinical medicine information extracting method based on neural network |
CN108664474A (en) * | 2018-05-21 | 2018-10-16 | 众安信息技术服务有限公司 | A kind of resume analytic method based on deep learning |
CN109117795A (en) * | 2018-08-17 | 2019-01-01 | 西南大学 | Neural network expression recognition method based on graph structure |
WO2019041529A1 (en) * | 2017-08-31 | 2019-03-07 | 平安科技(深圳)有限公司 | Method, electronic apparatus, and computer readable storage medium for identifying company as subject of news report |
CN109800332A (en) * | 2018-12-04 | 2019-05-24 | 北京明略软件系统有限公司 | Method, apparatus, computer storage medium and the terminal of processing field name |
CN110020190A (en) * | 2018-07-05 | 2019-07-16 | 中国科学院信息工程研究所 | A kind of suspected threat index verification method and system based on multi-instance learning |
CN110019711A (en) * | 2017-11-27 | 2019-07-16 | 吴谨准 | A kind of control method and device of pair of medicine text data structureization processing |
JP2019531562A (en) * | 2017-02-23 | 2019-10-31 | ▲騰▼▲訊▼科技(深▲セン▼)有限公司 | Keyword extraction method, computer apparatus, and storage medium |
CN111428511A (en) * | 2020-03-12 | 2020-07-17 | 北京明略软件系统有限公司 | Event detection method and device |
CN111696674A (en) * | 2020-06-12 | 2020-09-22 | 电子科技大学 | Deep learning method and system for electronic medical record |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
CN104615983A (en) * | 2015-01-28 | 2015-05-13 | 中国科学院自动化研究所 | Behavior identification method based on recurrent neural network and human skeleton movement sequences |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
-
2016
- 2016-05-03 CN CN201610284717.2A patent/CN105955952A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9263036B1 (en) * | 2012-11-29 | 2016-02-16 | Google Inc. | System and method for speech recognition using deep recurrent neural networks |
CN104615983A (en) * | 2015-01-28 | 2015-05-13 | 中国科学院自动化研究所 | Behavior identification method based on recurrent neural network and human skeleton movement sequences |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
CN104952448A (en) * | 2015-05-04 | 2015-09-30 | 张爱英 | Method and system for enhancing features by aid of bidirectional long-term and short-term memory recurrent neural networks |
Non-Patent Citations (3)
Title |
---|
ALEX GRAVES ET AL.: "Speech recognition with deep recurrent neural networks", 《2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS》 * |
JASON P.C. CHIU ET AL.: "Named Entity Recognition with Bidirectional LSTM-CNNs", 《ARXIV:1511.08308V1》 * |
胡新辰: "基于LSTM的语义关系分类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2019531562A (en) * | 2017-02-23 | 2019-10-31 | ▲騰▼▲訊▼科技(深▲セン▼)有限公司 | Keyword extraction method, computer apparatus, and storage medium |
CN107507052B (en) * | 2017-07-17 | 2021-04-09 | 苏州凯联信息科技有限公司 | Quotation information acquisition method and device |
CN107507052A (en) * | 2017-07-17 | 2017-12-22 | 苏州凯联信息科技有限公司 | A kind of quotation information acquisition methods and device |
WO2019041529A1 (en) * | 2017-08-31 | 2019-03-07 | 平安科技(深圳)有限公司 | Method, electronic apparatus, and computer readable storage medium for identifying company as subject of news report |
CN110019711A (en) * | 2017-11-27 | 2019-07-16 | 吴谨准 | A kind of control method and device of pair of medicine text data structureization processing |
CN108090045A (en) * | 2017-12-20 | 2018-05-29 | 珠海市君天电子科技有限公司 | A kind of method for building up of marking model, segmenting method and device |
CN108090045B (en) * | 2017-12-20 | 2021-04-30 | 珠海市君天电子科技有限公司 | Word segmentation method and device and readable storage medium |
CN108182976A (en) * | 2017-12-28 | 2018-06-19 | 西安交通大学 | A kind of clinical medicine information extracting method based on neural network |
CN108154191A (en) * | 2018-01-12 | 2018-06-12 | 北京经舆典网络科技有限公司 | The recognition methods of file and picture and system |
CN108154191B (en) * | 2018-01-12 | 2021-08-10 | 北京经舆典网络科技有限公司 | Document image recognition method and system |
CN108664474A (en) * | 2018-05-21 | 2018-10-16 | 众安信息技术服务有限公司 | A kind of resume analytic method based on deep learning |
CN110020190A (en) * | 2018-07-05 | 2019-07-16 | 中国科学院信息工程研究所 | A kind of suspected threat index verification method and system based on multi-instance learning |
CN109117795A (en) * | 2018-08-17 | 2019-01-01 | 西南大学 | Neural network expression recognition method based on graph structure |
CN109117795B (en) * | 2018-08-17 | 2022-03-25 | 西南大学 | Neural network expression recognition method based on graph structure |
CN109800332A (en) * | 2018-12-04 | 2019-05-24 | 北京明略软件系统有限公司 | Method, apparatus, computer storage medium and the terminal of processing field name |
CN111428511A (en) * | 2020-03-12 | 2020-07-17 | 北京明略软件系统有限公司 | Event detection method and device |
CN111428511B (en) * | 2020-03-12 | 2023-05-26 | 北京明略软件系统有限公司 | Event detection method and device |
CN111696674A (en) * | 2020-06-12 | 2020-09-22 | 电子科技大学 | Deep learning method and system for electronic medical record |
CN111696674B (en) * | 2020-06-12 | 2023-09-08 | 电子科技大学 | Deep learning method and system for electronic medical records |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105955952A (en) | Information extraction method based on bidirectional recurrent neural network | |
CN105975555A (en) | Enterprise abbreviation extraction method based on bidirectional recurrent neural network | |
CN108763326B (en) | Emotion analysis model construction method of convolutional neural network based on feature diversification | |
CN105976056A (en) | Information extraction system based on bidirectional RNN | |
Du et al. | Explicit interaction model towards text classification | |
CN110110335B (en) | Named entity identification method based on stack model | |
CN107145483A (en) | A kind of adaptive Chinese word cutting method based on embedded expression | |
CN107122416A (en) | A kind of Chinese event abstracting method | |
CN105975987A (en) | Enterprise industry classification method based on full-automatic learning | |
CN107145484A (en) | A kind of Chinese word cutting method based on hidden many granularity local features | |
CN106294322A (en) | A kind of Chinese based on LSTM zero reference resolution method | |
CN106682089A (en) | RNNs-based method for automatic safety checking of short message | |
CN111353042A (en) | Fine-grained text viewpoint analysis method based on deep multi-task learning | |
CN107590127A (en) | A kind of exam pool knowledge point automatic marking method and system | |
CN106844349A (en) | Comment spam recognition methods based on coorinated training | |
CN105975457A (en) | Information classification prediction system based on full-automatic learning | |
US20220309254A1 (en) | Open information extraction from low resource languages | |
CN111475615A (en) | Fine-grained emotion prediction method, device and system for emotion enhancement and storage medium | |
CN111914553B (en) | Financial information negative main body judging method based on machine learning | |
CN111967265B (en) | Chinese word segmentation and entity recognition combined learning method for automatic generation of data set | |
Joshua Thomas et al. | A deep learning framework on generation of image descriptions with bidirectional recurrent neural networks | |
CN111831783A (en) | Chapter-level relation extraction method | |
CN105975456A (en) | Enterprise entity name analysis and identification system | |
CN110941700B (en) | Multi-task joint learning-based argument mining system and working method thereof | |
Paul et al. | Detecting hate speech using deep learning techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160921 |
|
WD01 | Invention patent application deemed withdrawn after publication |