CN107145503A

CN107145503A - Remote supervision non-categorical relation extracting method and system based on word2vec

Info

Publication number: CN107145503A
Application number: CN201710166727.0A
Authority: CN
Inventors: 赵明; 杜会芳; 董翠翠; 陈瑛
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2017-03-20
Filing date: 2017-03-20
Publication date: 2017-09-08

Abstract

The present invention discloses a kind of remote supervision non-categorical relation extracting method and system based on word2vec, can more accurately extract the non-categorical relation in vegetables field.Method includes：The network vegetables field unstructured text data of network encyclopaedia and large-scale vegetables website is crawled as language material, language material is pre-processed successively, preliminary training corpus is obtained；Word2vec models are trained using preliminary training corpus, the space vector of each sentence is obtained using word2vec models；Preliminary training corpus is polymerize according to non-categorical relationship type, for the aggregated data of each relation, common sentence pattern and uncommon sentence pattern is extracted；Selection two meets the sentence space vector of two kinds of different modes as the initial center of k means clustering methods respectively, all sentence space vectors is clustered, selection meets a class of common sentence pattern, obtains the preferable training corpus of quality；By the preferable training corpus training convolutional neural networks model of quality, by full softmax layers of a connection, non-categorical relation is extracted.

Description

Remote supervision non-categorical relation extracting method and system based on word2vec

Technical field

The present invention relates to Weakly supervised classification field, and in particular to a kind of remote supervision non-categorical relation based on word2vec is carried Take method and system.

Background technology

Currently in terms of the class ontology knowledge collection of illustrative plates of agriculture field, research also in the starting stage, non-categorical relation (except Other relations of hyponymy classification relation) pertinent literature report it is also fewer.Although there is document respectively towards ancient agriculture Learn the study that non-categorical relation has been also related to Tea Science field, e.g., He Lin's《The semi-automatic structure of domain body and retrieval are ground Study carefully》, Xu Jicheng《Towards the body learning Modeling Research in vegetables field》Deng, but be all to employ most basic correlation rule side Method finds the concept pair that there is relation.The relation species not only extracted is not enough enriched, and language material does not have essentially from books and document yet Have and utilize data resource huge on Web.And the accuracy rate of the non-categorical relation extracted is also far below general classification relation Extract accuracy rate.

Non-categorical Relation extraction is carried out using remote measure of supervision, label noise can be produced more, Zeng, D. et al. exists 《Distant Supervision for Relation Extraction via Piecewise Convolutional Neural Networks》Using many case-based learning methods removal noise, Takamatsu S et al. exist《Reducing wrong labels in distant supervision for relation extraction》Label is removed using high-quality template Noise.

But the clustering algorithm that label noise is removed in most of remote supervision relation recognition methods does not take into full account vector Grammer, semantic information in space between each term vector, and during network encyclopaedia and vegetables website are described to the entry of vegetable variety, Contextual information is critically important, and on relation extract influence it is very big, therefore, how to provide a kind of degree of accuracy it is higher be applied to vegetable The non-categorical relation extracting method in dish field, as technical problem urgently to be resolved hurrily.

The content of the invention

For defect of the prior art, the embodiment of the present invention provides a kind of remote supervision non-categorical based on word2vec and closed It is extracting method and system.

On the one hand, the embodiment of the present invention proposes a kind of remote supervision non-categorical relation extracting method based on word2vec, bag Include：

S1, the network vegetables field unstructured text data of network encyclopaedia and large-scale vegetables website is crawled as language material, The language material is pre-processed successively, alignment of data, obtain preliminary training corpus；

In the present embodiment, the language material is pre-processed successively, alignment of data is specially that the language material is carried out successively The processing such as participle, part-of-speech tagging, and by alignment of data in result and knowledge base.

S2, word2vec models are trained using the preliminary training corpus, and will be upper using the word2vec models The word stated in the sentence in preliminary training corpus changes into space vector, for each sentence, by the sky of the word in the sentence Between addition of vectors and do the space vector that average treatment obtains the sentence；

S3, the preliminary training corpus polymerize according to non-categorical relationship type, for polymerization obtain it is each The aggregated data of relation, extracts common sentence pattern and uncommon sentence pattern；

S4, k is set to 2, selects two sentence space vectors for meeting two kinds of different modes respectively to make with heuristics manner For the initial center of k-means clustering methods, and all sentence space vectors are clustered, selection meets common sentence pattern A class, obtain the preferable training corpus of quality；

S5, by the preferable training corpus training convolutional neural networks model of the quality, by constituting the convolutional Neural A convolutional layer, a pond layer and full softmax layers of a connection for network model, is extracted from the space vector of the sentence Non-categorical relation.

On the other hand, a kind of remote supervision non-categorical relation extraction system based on word2vec of the embodiment of the present invention, including：

Acquiring unit, the network vegetables field unstructured text data for crawling network encyclopaedia and large-scale vegetables website As language material, the language material is pre-processed successively, alignment of data, obtain preliminary training corpus；

Training unit, for training word2vec models using the preliminary training corpus, and described in Word in sentence in above-mentioned preliminary training corpus is changed into space vector by word2vec models, for each sentence, will The space vector phase adduction of word in the sentence does the space vector that average treatment obtains the sentence；

Polymerized unit, for the preliminary training corpus to be polymerize according to non-categorical relationship type, for polymerization The aggregated data of obtained each relation, extracts common sentence pattern and uncommon sentence pattern；

Cluster cell, for k to be set into 2, two sentences for meeting two kinds of different modes respectively are selected with heuristics manner Space vector and is clustered as the initial center of k-means clustering methods to all sentence space vectors, and selection meets normal See a class of sentence pattern, obtain the preferable training corpus of quality；

Extraction unit, for by the preferable training corpus training convolutional neural networks model of the quality, by constituting A convolutional layer, a pond layer and full softmax layers of a connection for convolutional neural networks model is stated, from the sky of the sentence Between vector extract non-categorical relation.

The remote supervision non-categorical relation extracting method and system based on word2vec that the embodiment of the present invention is proposed, with network Vegetables field non-structured text is language material, carries out language material training using word2vec instruments, label is reduced by clustering algorithm Noise, finally utilizes convolutional neural networks model extraction non-categorical relation.Word2vec instruments used not only train term vector With high efficiency, and the term vector obtained can obtain grammer, semantic information, and this allows for clustering by clustering algorithm To sentence there is syntactic and semantic information, this by effective guarantee far supervise remove label noise effect.In addition, utilizing convolution Neural network model, which extracts non-categorical relation, can be prevented effectively from the processing procedure error accumulation of natural language processing instrument multistage Problem, thus, compared to the grammer, the prior art of semantic information not taken into full account in vector space between each term vector, The present invention is more suitable for vegetables field, and the degree of accuracy that non-categorical relation is extracted is higher.

Brief description of the drawings

Fig. 1 shows for a kind of flow of the remote embodiment of supervision non-categorical relation extracting method one based on word2vec of the present invention It is intended to；

Fig. 2 shows for a kind of structure of the remote embodiment of supervision non-categorical relation extraction system one based on word2vec of the present invention It is intended to.

Embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is explicitly described, it is clear that described embodiment be the present invention A part of embodiment, rather than whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art are not having The every other embodiment obtained under the premise of creative work is made, the scope of protection of the invention is belonged to.

Referring to Fig. 1, the present embodiment discloses a kind of remote supervision non-categorical relation extracting method based on word2vec, including：

The S1, can include：

S10, using write language material collection shell script capture non-knot from network vegetables encyclopaedia and large-scale vegetables website Structure text data does the pretreatment such as low-frequency word filtering, participle, part-of-speech tagging to the language material as language material；

S11, the language material for obtaining step S10 are alignd with the relationship example in default knowledge base, obtain preliminary Training corpus,

This step be based on the assumption that：If there is certain semantic relation between two concepts, then all to include this The sentence of two entitative concepts also expresses this relation,

For example,<Health-care effect, tomato, stomach>Non-categorical relation, all include is focused to find out according to above-mentioned hypothesis from text The sentence I and II of " tomato " and " stomach "：

I. " tomato has effects that stomach strengthening and digestion promoting "；

II. " food tomato often results in stomach upset, stomach distending pain on an empty stomach ",

Non-categorical relationship example just constitutes an align data with these sentences, but II is not expressed as from the foregoing The relation of " health-care effect ", belongs to noise data, below step will remove label noise, and extract vegetables field non-categorical pass System：

The Word2vec used in step S2 is a Software tool for being used to train term vector that Google companies open, It is empty that each word in sentence is quickly effectively mapped to k dimensions by it according to given corpus, by the training pattern after optimization Between in the vector with actual value, and these vectors obtain grammer, semantic feature, its core architecture include CBOW and Skip-gram。

Wherein, CBOW models simply understand to be exactly that context determines the probability that current word occurs, and the present invention uses Skip- Gram models, this model is to predict the probability that context occurs with current word., usually can be because of place when handling language material The limitation of window size is managed, causes the relation between the word and current word of window ranges to arrive mould by correctly reflection Among type, if the complexity of training can be increased by expanding window merely again.Skip-gram models by " skipping some characters " very Good solves this problem.For example 2 four-tuples of network encyclopaedia entry " eggplant growth requires higher temperature ", are " eggplant respectively Son growth require it is higher ", " growth requires higher temperature " all do not express sentence original idea.Skip-gram models but allow some Word is skipped, if skipping two words, there is four-tuple " eggplant requirement higher temperature ", and " eggplant growth higher temperature " can be expressed Original idea.Comprised the following steps that using word2vec instruments：

(1) word2vec models, are trained using the training corpus tentatively obtained；

(2) space vector of each word in language material sentence, can be obtained by word2vec models, these term vectors include language Method and semantic information.The space vector phase adduction of all words in each sentence is done into handling averagely and obtains corresponding each sentence Vector.Such as, sentence " fresh kidney beans are rich in protein, carrotene, are of high nutritive value ", by the word2vec models of training, can obtain To " fresh kidney beans ", " being rich in ", " protein ", " carrotene ", " nutrition ", " value ", " height " space vector, by upper predicate to Amount phase adduction does the average space vector that can obtain whole sentence.

The S3, can include：

S30, the training corpus tentatively obtained is polymerize according to the non-categorical relationship type contained by sentence, it is right In the aggregated data of each relation, sentence pattern is found using DL-CoTrain algorithms, one of them common sentence mould is extracted Formula and a uncommon sentence pattern, that is, select and cause h (x)=high model of (count (x)+a)/(N+ka) score values, wherein K is classification number 2, and a represents smoothing parameter (generally 0.1), and count (x) represents the number of times that feature x occurs, and N represents a kind of non- The number of the align data of classification relation；

The S4, can include：

S40, two sentences for meeting different models of selection are used as the initial center of two classes；

S41, k is set to 2, all sentences for meeting both sentence patterns are gathered using K-means clustering algorithms Class, selection meets a class of common sentence pattern.This process is because be based on the text space word with syntactic and semantic information Vector, therefore the sentence finally given also has syntactic and semantic information, can effectively remove label noise, obtain quality preferable Training corpus；

S5, by the preferable training corpus training convolutional neural networks model of the quality, by constituting the convolutional Neural Full softmax layers of the connection of one of network model, non-categorical relation is extracted from the space vector of the sentence.

The S5, can include：

S50, by the preferable training corpus training convolutional neural networks model of the quality, by the space vector of the sentence The convolutional neural networks are inputted, text feature is automatically extracted by the convolutional layer of the convolutional neural networks successively, pond layer is done Down-sampling, full articulamentum exports the prediction probability of non-categorical relation, wherein, the convolutional neural networks model includes a convolution Layer, a pond layer and full softmax layers of a connection.

It is understood that convolutional neural networks structure includes a convolutional layer, a pond layer and a full connection Softmax layers, multiple sentence characteristics values are automatically extracted by each convolutional layer, can select most heavy using maximum pond operation And there are the sentence characteristics of regular length.The sentence characteristics vector that finally all convolutional layers are generated is concatenated, and is obtained One new sentence characteristics vector, all characteristic vectors are integrated, and are connected entirely eventually as incoming one an of characteristic vector In softmax layers, the probability distribution of each non-categorical relation is finally exported.

The remote supervision non-categorical relation extracting method based on word2vec that the present embodiment is proposed, it is non-with network vegetables field Structured text is language material, and language material training is carried out using word2vec instruments, and label noise, last profit are reduced by clustering algorithm With convolutional neural networks model extraction non-categorical relation.Word2vec instruments used not only train term vector to have high efficiency, And the term vector obtained can obtain grammer, semantic information, this allows for clustering obtained sentence tool by clustering algorithm There is syntactic and semantic information, this far supervises effective guarantee the effect for removing label noise.In addition, utilizing convolutional neural networks mould Type, which extracts non-categorical relation, can be prevented effectively from natural language processing instrument multistage processing procedure error accumulation problem, thus, Compared to the grammer, the prior art of semantic information not taken into full account in vector space between each term vector, the present invention is more suitable For vegetables field, and the degree of accuracy that non-categorical relation is extracted is higher.

Referring to Fig. 2, the present embodiment discloses a kind of remote supervision non-categorical relation extraction system based on word2vec, including：

Acquiring unit 1, the network vegetables field non-structured text number for crawling network encyclopaedia and large-scale vegetables website According to as language material, being pre-processed successively to the language material, alignment of data, obtain preliminary training corpus；

In the present embodiment, the acquiring unit can include：

Subelement is captured, for gathering shell script from network vegetables encyclopaedia and large-scale vegetables website using the language material write Upper crawl unstructured text data does the pre- places such as low-frequency word filtering, participle, part-of-speech tagging to the language material as language material Reason；

Align subelement, and the language material for the crawl subelement to be obtained enters with the relationship example in default knowledge base Row alignment, obtains preliminary training corpus.

Training unit 2, for training word2vec models using the preliminary training corpus, and described in Word in sentence in above-mentioned preliminary training corpus is changed into space vector by word2vec models, for each sentence, will The space vector phase adduction of word in the sentence does the space vector that average treatment obtains the sentence；

Polymerized unit 3, for the preliminary training corpus to be polymerize according to non-categorical relationship type, for polymerization The aggregated data of obtained each relation, extracts common sentence pattern and uncommon sentence pattern；

The polymerized unit, specifically can be used for：

The training corpus tentatively obtained is polymerize according to the non-categorical relationship type contained by sentence, for every The aggregated data of individual relation, using DL-CoTrain algorithms find sentence pattern, extract one of them common sentence pattern and One uncommon sentence pattern.

Cluster cell 4, for k to be set into 2, two sentences for meeting two kinds of different modes respectively are selected with heuristics manner Space vector and is clustered as the initial center of k-means clustering methods to all sentence space vectors, and selection meets normal See a class of sentence pattern, obtain the preferable training corpus of quality；

Extraction unit 5, for by the preferable training corpus training convolutional neural networks model of the quality, by constituting A convolutional layer, a pond layer and full softmax layers of a connection for convolutional neural networks model is stated, from the sky of the sentence Between vector extract non-categorical relation.

The extraction unit, specifically can be used for：

By the preferable training corpus training convolutional neural networks model of the quality, the space vector of the sentence is inputted The convolutional neural networks, automatically extract text feature by the convolutional layer of the convolutional neural networks successively, and pond layer, which is done down, to be adopted Sample, full articulamentum exports the prediction probability of non-categorical relation, wherein, the convolutional neural networks model includes a convolutional layer, One pond layer and full softmax layers of a connection.

The remote supervision non-categorical relation extraction system based on word2vec that the present embodiment is proposed, it is non-with network vegetables field Structured text is language material, and language material training is carried out using word2vec instruments, and label noise, last profit are reduced by clustering algorithm With convolutional neural networks model extraction non-categorical relation.Word2vec instruments used not only train term vector to have high efficiency, And the term vector obtained can obtain grammer, semantic information, this allows for clustering obtained sentence tool by clustering algorithm There is syntactic and semantic information, this far supervises effective guarantee the effect for removing label noise.In addition, utilizing convolutional neural networks mould Type, which extracts non-categorical relation, can be prevented effectively from natural language processing instrument multistage processing procedure error accumulation problem, thus, Compared to the grammer, the prior art of semantic information not taken into full account in vector space between each term vector, the present invention is more suitable For vegetables field, and the degree of accuracy that non-categorical relation is extracted is higher.

The invention has the advantages that；

In terms of application field, this invention address that extracting vegetables field non-categorical relation, non-categorical relation is in very great Cheng Degree can improve the accuracy rate and recall rate of information inquiry in the magnanimity information of vegetables field, increase the completeness of knowledge representation, will The intelligent semantic information service of vegetables information for needed for rapidly and accurately obtaining people brings possibility, improves vegetables Informatization The level of service.

Although being described in conjunction with the accompanying embodiments of the present invention, those skilled in the art can not depart from this hair Various modifications and variations are made in the case of bright spirit and scope, such modifications and variations are each fallen within by appended claims Within limited range.

Claims

1. a kind of remote supervision non-categorical relation extracting method based on word2vec, it is characterised in that including：

S1, the network vegetables field unstructured text data of network encyclopaedia and large-scale vegetables website is crawled as language material, to institute Predicate material is pre-processed successively, alignment of data, obtains preliminary training corpus；

S2, train word2vec models using the preliminary training corpus, and using the word2vec models will it is above-mentioned at the beginning of The word in sentence in the training corpus of step changes into space vector, for each sentence, by the space of the word in the sentence to Amount phase adduction does the space vector that average treatment obtains the sentence；

S3, the preliminary training corpus polymerize according to non-categorical relationship type, each relation obtained for polymerization Aggregated data, extract common sentence pattern and uncommon sentence pattern；

S4, k is set to 2, selects two sentence space vectors for meeting two kinds of different modes to be respectively used as k- using heuristics manner The initial center of means clustering methods, and all sentence space vectors are clustered, selection meets the one of common sentence pattern Class, obtains the preferable training corpus of quality；

S5, by the preferable training corpus training convolutional neural networks model of the quality, by constituting the convolutional neural networks A convolutional layer, a pond layer and full softmax layers of a connection for model, extracts overstepping one's bounds from the space vector of the sentence Class relation.

2. the remote supervision non-categorical relation extracting method according to claim 1 based on word2vec, it is characterised in that institute S1 is stated, including：

S10, using write language material collection shell script captured from network vegetables encyclopaedia and large-scale vegetables website it is unstructured Text data does the pretreatment such as low-frequency word filtering, participle, part-of-speech tagging to the language material as language material；

S11, the language material for obtaining step S10 are alignd with the relationship example in default knowledge base, obtain preliminary training Language material.

3. the remote supervision non-categorical relation extracting method according to claim 2 based on word2vec, it is characterised in that institute S3 is stated, including：

S30, the training corpus tentatively obtained is polymerize according to the non-categorical relationship type contained by sentence, for every The aggregated data of individual relation, using DL-CoTrain algorithms find sentence pattern, extract one of them common sentence pattern and One uncommon sentence pattern.

4. the remote supervision non-categorical relation extracting method according to claim 3 based on word2vec, it is characterised in that institute S5 is stated, including：

S50, by the preferable training corpus training convolutional neural networks model of the quality, the space vector of the sentence is inputted The convolutional neural networks, automatically extract text feature by the convolutional layer of the convolutional neural networks successively, and pond layer, which is done down, to be adopted Sample, full articulamentum exports the prediction probability of non-categorical relation, wherein, the convolutional neural networks model includes a convolutional layer, One pond layer and full softmax layers of a connection.

5. a kind of remote supervision non-categorical relation extraction system based on word2vec, it is characterised in that including：

Acquiring unit, the network vegetables field unstructured text data conduct for crawling network encyclopaedia and large-scale vegetables website Language material, is pre-processed, alignment of data successively to the language material, obtains preliminary training corpus；

Training unit, for training word2vec models using the preliminary training corpus, and utilizes the word2vec moulds Word in sentence in above-mentioned preliminary training corpus is changed into space vector by type, for each sentence, by the sentence The space vector phase adduction of word does the space vector that average treatment obtains the sentence；

Polymerized unit, for the preliminary training corpus to be polymerize according to non-categorical relationship type, is obtained for polymerization Each relation aggregated data, extract common sentence pattern and uncommon sentence pattern；

Cluster cell, for k to be set into 2, two sentence spaces for meeting two kinds of different modes respectively are selected with heuristics manner The vectorial initial center as k-means clustering methods, and all sentence space vectors are clustered, selection meets common sentence One class of subpattern, obtains the preferable training corpus of quality；

Extraction unit, for by the preferable training corpus training convolutional neural networks model of the quality, by constituting the volume Product one convolutional layer of neural network model, a pond layer and one connect softmax layers entirely, from the space of the sentence to Amount extracts non-categorical relation.

6. the remote supervision non-categorical relation extraction system according to claim 5 based on word2vec, it is characterised in that institute Acquiring unit is stated, including：

Subelement is captured, for being grabbed using the language material collection shell script write from network vegetables encyclopaedia and large-scale vegetables website Unstructured text data is taken as language material, and the pretreatment such as low-frequency word filtering, participle, part-of-speech tagging is done to the language material；

Align subelement, for language material and the relationship example progress pair in default knowledge base for obtaining the crawl subelement Together, preliminary training corpus is obtained.

7. the remote supervision non-categorical relation extraction system according to claim 6 based on word2vec, it is characterised in that institute Polymerized unit is stated, specifically for：

The training corpus tentatively obtained is polymerize according to the non-categorical relationship type contained by sentence, closed for each The aggregated data of system, finds sentence pattern using DL-CoTrain algorithms, extracts one of them common sentence pattern and one Uncommon sentence pattern.

8. the remote supervision non-categorical relation extraction system according to claim 7 based on word2vec, it is characterised in that institute Extraction unit is stated, specifically for：

By the preferable training corpus training convolutional neural networks model of the quality, the space vector of the sentence is inputted described Convolutional neural networks, automatically extract text feature by the convolutional layer of the convolutional neural networks successively, and pond layer does down-sampling, entirely Articulamentum exports the prediction probability of non-categorical relation, wherein, the convolutional neural networks model includes a convolutional layer, a pond Change layer and full softmax layers of a connection.