CN106570148A

CN106570148A - Convolutional neutral network-based attribute extraction method

Info

Publication number: CN106570148A
Application number: CN201610968810.5A
Authority: CN
Inventors: 汤斯亮; 吴飞; 张金剑; 蒋焕剑; 庄越挺; 鲁伟明
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-10-27
Filing date: 2016-10-27
Publication date: 2017-04-19
Anticipated expiration: 2036-10-27
Also published as: CN106570148B

Abstract

The invention discloses a convolutional neutral network-based attribute extraction method. The method comprises the following steps of (1) constructing an external knowledge library; (2) obtaining text data; (3) obtaining attribute-containing sentences by using a remote supervision method; (4) obtaining the sentences by utilizing a word vector method and performing vectorization; and (5) inputting the sentences to a convolutional neutral network, and performing training and classification. According to the method, the attribute-containing candidate sentences are extracted from a non-structured text data set based on artificially defined mapping by utilizing the external knowledge library in combination with remote supervision and convolutional neutral network models, and sentence classifications are classified in combination with the convolutional neutral network model, so that an attribute extraction task is finished.

Description

A kind of attribute extraction method based on convolutional neural networks

Technical field

The present invention relates to Text character extraction and attribute extraction, more particularly to a kind of attribute based on convolutional neural networks is taken out Take method.

Background technology

The world today is in the epoch of an information huge explosion, and the prevalence and high speed development of internet generate magnanimity Information resources.These resources are great for the meaning of development in science and technology, and scientific circles need the base for therefrom extracting scientific research This material, industrial quarters needs therefrom to excavate potential business opportunity, therefore how to be near using the information resources of these internets One of main flow direction of Technological research over year.

Although the information resources quantity in internet is huge, these resources often lack structurized characteristic. Structural data refers to row data, the data that can be expressed with bivariate table structure, and unstructured data^[2]Field it is long Degree is variable, it has not been convenient to expressed using two-dimentional logical table.Because these resources are much destructurings or semi-structured Data, thus quickly and efficiently search and understand that these data are limited by very large.

Text data is the pith in internet information resource, and the most of text data on internet is also non-knot The data of structure, such as news, blog, Email, governmental documents, chat record, system journal etc..In order to efficiently Using these unstructured text datas, information extraction (Information Extraction) technology is arisen at the historic moment-automatically will Destructuring or semi-structured text in input page changes into structurized data.Information extraction task is by input Define with the target for extracting, input can be the non-structured document write using natural language, or half on webpage Structurized document；And extraction target is the relation of k- tuples (k is the number of attributes of a record) or the layer of a complexity The data object of secondaryization.

Traditional attribute extraction technology has many drawbacks, whether rule-based first to be also based on sorting algorithm, all More manpower intervention is needed, such as rule-based rule design and the data based on classification are marked and characteristic Design, and The cost of manpower intervention is expensive, and needs the mark of professional just to obtain more authoritative artificial data, while people Work also brings along certain error, and error also can be accumulated constantly in follow-up algorithm, and the deviation for ultimately resulting in result is excessive； Secondly, the training dataset of this class algorithm is limited in certain field, i.e., do not have versatility, such as train with regard to body Educating the attribute extraction grader in terms of news cannot use well in other news；The effect one that generally above-mentioned algorithm is obtained As it is also not ideal enough because the rule of engineer is limited in rule-based method, and be based on the mark of the method for classification Note data are also limited and the method compares the quality of the feature for relying on engineer.

The content of the invention

The purpose of the present invention is to overcome the deficiencies in the prior art, there is provided a kind of feature extraction side based on convolutional neural networks Method.

Based on the attribute extraction method of convolutional neural networks, comprise the steps

1) the information frame data of Wikipedia is obtained, external knowledge storehouse is obtained；

2) forum and news data are obtained, obtains text corpus；

3) using remote supervisory method, text corpus are scanned for, obtains the sentence comprising attribute；

4) based on term vector, vectorization description is first carried out to each word, then the term vector in sentence is flocked together, obtained Vectorization to sentence is described；

5) by sentence inputting convolutional neural networks, convolutional calculation is first carried out to sentence, then for the maximum of convolution results Pond is carried out, softmax functions are recently entered, classification results are obtained；

6) according to classification results correspondence to property value.

Each step can be using following preferred implementation in the present invention：

Described step 4) it is specific as follows：

4.1) the term vector model G trained in news data is obtained, is preserved by sequence form：

[word₁ vector₁ word₂ vector₂...word_N vector_N]

Wherein N is the word number in term vector model, word_NFor n-th word, vector_NFor n-th term vector；

4.2) file is read, sequence is switched to mapping format, its corresponding vector is mapped to by word：

{word₁:vector₁,word₂:vector₂...word_N:vector_N}

4.3) sentence is read, and carries out participle, the sequence that sentence is switched to be composed of words：

[word_i1,word_i2...word_il]

Wherein i represents i-th sentence, and l represents the word number in sentence, word_ilRepresent l-th word of i-th sentence；

4.4) sequence is read, and inquires about mapping, the word in sequence is switched to into term vector：

[vector_i1,vector_i2...vector_il]

Wherein vector_ilRepresent l-th term vector of i-th sentence.

Described step 5) it is specific as follows：

5.1) term vector in sequence is switched to into column vector：

X=[x₁,x₂,...,x_l]

Wherein x_lRepresent the column vector of the term vector of l-th word in sentence；

5.2) convolution operation is carried out to sentence, obtains the sequence after convolution：

S=[s₁,s₂,...,s_l-ω+1]

Wherein the window size of convolutional layer is ω, and a convolution kernel is f=[f₁,f₂,...,f_ω], f_iFor size and word The equally big column vector of vector；

5.3) sequence after convolution is calculated into convolution results sequence s after activation by ReLU activation primitives_i

Wherein, b is bias term, and g is ReLU activation primitives.

5.4) repeatedly 5.1)～5.3), computed repeatedly using the convolution kernel of different convolution kernels and different windows size, Final each convolution kernel obtains a convolution results sequence, for the original word vector sum position vector of abstract expression.

5.5) for result sequence s that each convolution kernel f is obtained_i, pond process is carried out using max functions, select this As a result the maximum in sequence all values is used as new result feature p_f：

p_f=max { s }=max { s₁,s₂,...,s_l-ω+1}

Same pond process is all carried out to all of convolution kernel, the equal spy of the quantity of a length and convolution kernel is obtained Vector is levied, using characteristic vector as the abstract sentence characteristics for obtaining of whole convolutional neural networks；

5.6) pond result is input into into softmax functions, obtains the classification results output (x of jth class_j)：

Wherein K represents classification number.

Various drawbacks in conventional method, the present invention proposes mark more than a kind of many examples based on convolutional neural networks The attribute extraction algorithm of label, using the method for Distant Supervision existing knowledge base is utilized, and automatically generates training number According to, and using some optimization method cleaning training datas, so as to save the tedious work of artificial mark.The present invention also uses convolution The method of neutral net while saving manual working, extracts more abstract and more automatically extracting the feature of text sentence The feature of tool expressiveness.Finally using the model of many example multi-tags, to solve and there may be various relations between two entities Problem.The square algorithm of some main flows of the method in effect better than traditional attribute extraction algorithm and in recent years.

Description of the drawings

Fig. 1 is pseudo- graph expression of the core used herein using convolutional neural networks model.Left side shows respectively in figure Two positive samples and negative sample, positive and negative sample standard deviation extracted using remote supervisory method.Mapped according to word and vector sum Relation, extracts the vector statement of sentence, after the operation of convolution pondization is carried out, then is input into softmax functions.

Fig. 2 is that Medical Data collection describes fragment with regard to the text of description medicine and side effect, is respectively medical side effect Positive class label and negative class label candidate sentence.

Specific embodiment

The present invention is further elaborated with reference to the accompanying drawings and detailed description.

Comprised the steps based on the attribute extraction method of convolutional neural networks：

1) the information frame data of Wikipedia is obtained, external knowledge storehouse is obtained.Comprise the following steps that：

1.1. the public data of wikipedia is downloaded

2.1. the data frame of entry in wikipedia is extracted, and data frame name is mapped to Property Name, and store attribute Value and entry name.

3.1. in all entries and all data frames, attribute-name identical property value and entry name are saved together.

2) forum and news data are obtained, obtains text corpus.Comprise the following steps that：

2.1. news data, the such as public data of the New York Times are downloaded.

2.2. text data is pre-processed, the labels such as HTML or XML is removed, character encoding format switchs to utf-8, Form switchs to plain text data.

2.3. natural language processing instrument, such as Stanford Core NLP instruments is used to carry out to plain text data point Word, and extract name entity information.

The public data of wikipedia and news data directly adopt the data sets of TAC-KBP 2015 in the present invention.

3) using remote supervisory method, text corpus are scanned for, obtains the sentence comprising attribute.Concrete steps are such as Under：

3.1. positive sample is built.Under same attribute-name, if physical name and property value are occurred in a certain sentence simultaneously, Then this sentence is labeled as positive sample.

3.2. negative sample is built.Under same attribute-name, if physical name is occurred in a certain sentence, property value does not go out Now in the sentence, but name entity information of the sentence comprising property value.Then the sentence is labeled as negative sample.

3.3. stochastical sampling is carried out to negative sample, makes negative sample number and positive sample number roughly equal.

4) based on term vector, vectorization description is first carried out to each word, then the term vector in sentence is flocked together, obtained Vectorization to sentence is described.Comprise the following steps that：

4.1) obtain Google and be disclosed in the term vector model G trained in news data, preserved by sequence form：

[word₁ vector₁ word₂ vector₂...word_N vector_N]

{word₁:vector₁,word₂:vector₂...word_N:vector_N}

[word_i1,word_i2...word_il]

[vector_i1,vector_i2...vector_il]

Wherein vector_ilRepresent l-th term vector of i-th sentence.

5) by sentence inputting convolutional neural networks, convolutional calculation is first carried out to sentence, then for the maximum of convolution results Pond is carried out, softmax functions are recently entered, classification results are obtained.Comprise the following steps that：

5.1) term vector in sequence is switched to into column vector：

X=[x₁,x₂,...,x_l]

S=[s₁,s₂,...,s_l-ω+1]

Wherein, b is bias term, and g is ReLU activation primitives.

p_f=max { s }=max { s₁,s₂,...,s_l-ω+1}

Wherein K represents classification number.

6) according to classification results correspondence to property value：According to the sentence classification for getting, entity and property value pair are got, Obtain entity attributes value.

Embodiment 1

The present embodiment carries out attribute extraction by taking one section of newsletter archive that user submits to as an example using said method, each to implement Specific parameter and way are as follows in step：

1. the sentence of pair input, searches for whether contain entity, and the sentence containing entity is constituted into original statement set

{sentence1,sentence2,...sentencesN}

2. in original statement set, whether search contains property value, and the sentence containing property value is constituted into candidate sentences Set (as shown in Figure 2), sentence is simultaneously comprising entity and property value in Candidate Set.

{candidate1,candidate2,...candidateN₁}

3. record in Candidate Set in candidate sentences, entity and property value pair, with sequence form storage, and sentence in Candidate Set Order it is corresponding.

{(entity1,slot filler 1),(entity2,slot filler 2),...(entityN₁,slot filler N₁)}

4. the term vector model of Google is downloaded, sequence form data are read.

[word₁ vector₁ word₂ vector₂...word_N vector_N]

5. sequence is switched to mapping format, its corresponding vector is mapped to by word.

{word₁:vector₁,word₂:vector₂...word_N:vector_N}

6. sentence carries out participle in pair Candidate Set, switchs to the sequence of word.

[word_i1,word_i2...word_is]

7. sentence in pair Candidate Set, reads the mapped file that Google provides term vector model, obtains the vectorization of sentence Statement.

[vector_i1,vector_i2...vector_is]

8., for (as shown in Figure 1) is stated in the vectorization of sentence, each term vector is switched to into the form of column vector, this When, sentence switchs to two-dimensional matrix form.

X=[x₁,x₂,...,x_l]

9. pair sentence carries out the convolution operation that convolution kernel is 3, while adding biasing.

10. convolution kernel output result is input into into activation primitive, using ReLU activation primitives.

11. reuse convolution kernel 4,5, repeat 9 and operation 10, and input value is cascaded.

S=[s₁,s₂,...,s_l-ω+1,s₂₁,s₂₂,...,s_l2-ω2+1,s₃₁,s₃₂,...,s_l3-ω3+1]

12. enter pond layer, the output of convolutional layer are operated, using maximum pond.

p_f=max { s }=max { s₁,s₂,...,s_l-ω+1}

13. as shown in Fig. 2 by the output result input softmax functions of pond layer, obtain classification results.

15. according to output result, according to sentence order, finds corresponding entity attribute pair, has also just obtained attribute extraction Result.

(entity1,slot filler 1)

As shown in table 1, on the data sets of TAC-KBP 2015, method and the main stream approach for pre-existing described in the present invention Comparing result shows that the present invention is respectively provided with obvious advantage on Precision, Recall and F1-Score evaluation criterions.

Table 1

Model	Precision	Recall	F1-Score
				LR-SF	0.4483	0.3652	0.4025
MIML-SF	0.5412	0.3893	0.4529
				CNN-SF	0.5657	0.4067	0.4732
This paper models	0.6343	0.4136	0.5007

Claims

1. a kind of attribute extraction method based on convolutional neural networks, it is characterised in that comprise the steps

2) forum and news data are obtained, obtains text corpus；

4) based on term vector, vectorization description is first carried out to each word, then the term vector in sentence is flocked together, obtain sentence The vectorization description of son；

5) by sentence inputting convolutional neural networks, first carry out convolutional calculation to sentence, then carry out for the maximum of convolution results Chi Hua, recently enters softmax functions, obtains classification results；

6) according to classification results correspondence to property value.

2. a kind of attribute extraction method based on convolutional neural networks according to described by claim 1, it is characterised in that described The step of 4) it is specific as follows：

[word₁ vector₁ word₂ vector₂ ... word_N vector_N]

{word₁:vector₁,word₂:vector₂...word_N:vector_N}

[word_i1,word_i2...word_il]

[vector_i1,vector_i2...vector_il]

Wherein vector_ilRepresent l-th term vector of i-th sentence.

3. a kind of attribute extraction method based on convolutional neural networks according to described by claim 1, it is characterised in that described The step of 5) it is specific as follows：

5.1) term vector in sequence is switched to into column vector：

X=[x₁,x₂,...,x_l]

S=[s₁,s₂,...,s_l-ω+1]

Wherein the window size of convolutional layer is ω, and a convolution kernel is f=[f₁,f₂,...,f_ω], f_iFor size and word vector Equally big column vector；

s_{i} = g (Σ_{j = 0}^{ω - 1} f_{j + 1}^{T} x_{j + 1} + b)

Wherein, b is bias term, and g is ReLU activation primitives.

5.4) repeatedly 5.1)～5.3), computed repeatedly using the convolution kernel of different convolution kernels and different windows size, finally Each convolution kernel obtains a convolution results sequence, for the original word vector sum position vector of abstract expression.

5.5) for result sequence s that each convolution kernel f is obtained_i, pond process is carried out using max functions, select the result sequence Maximum in row all values is used as new result feature p_f：

p_f=max { s }=max { s₁,s₂,...,s_l-ω+1}

Same pond process is all carried out to all of convolution kernel, obtain the equal feature of quantity of a length and convolution kernel to Amount, using characteristic vector as the abstract sentence characteristics for obtaining of whole convolutional neural networks；

o u t p u t (x_{j}) = \frac{e^{p_{j}}}{{Σe}^{p_{j}}}, j = 1 ..., K

Wherein K represents classification number.