CN104199972B

CN104199972B - A kind of name entity relation extraction and construction method based on deep learning

Info

Publication number: CN104199972B
Application number: CN201410488047.7A
Authority: CN
Inventors: 袁伟; 邓攀; 闫碧莹; 赵鑫; 李玉成; 余雷
Original assignee: Zhong Kjia Speed (beijing) Information Technology Co Ltd
Current assignee: Zhong kjia speed (Beijing) Information Technology Co., Ltd.
Priority date: 2013-09-22
Filing date: 2014-09-22
Publication date: 2018-08-03
Anticipated expiration: 2034-09-22
Also published as: CN104199972A

Abstract

The name entity relation that the present invention provides a kind of based on deep learning extracts and construction method, is used for technical field of Internet information.This method is directed to a certain specific area, and the news data on Vertical Website in crawl field pre-processes the news data of acquisition；News data segments, extracting keywords, generates industry dictionary, is segmented again to news data using industry dictionary；Extract seed dictionary；Unsupervised structure entity relationship network, the sentence for including more than two entities is extracted from news data, extracts the verb in sentence and corresponding document, and the term clustering model based on deep learning is established to the document of extraction, according to the relationship between the word of verb description, entity relationship network is built；It defines entity relationship classification and relationship classification is carried out to each entity pair in entity relationship network.The present invention, which is not required to put into extensive manpower, is marked sample data, and the dependence of corpus is low, and the performance for extracting entity relationship is high.

Description

A kind of name entity relation extraction and construction method based on deep learning

Technical field

The present invention relates to technical field of Internet information, a kind of method extracted in particular to name entity relation.

Background technology

In information research field, information extraction technique is an essential key technology.In face of the letter of such magnanimity Space is ceased, how faster and more accurately to extract the interested content of user is a problem in the urgent need to address, and letter Cease an important research direction of digging technology.Information extraction is different from the information processing technologies such as information retrieval, it is needed to text Originally it is named the identification of entity, and extracts the relationship between entity, and the flexible and changeable of word, word-building are multiple in Chinese text It is miscellaneous and do not indicate significantly so that the extraction of identification and relationship to Chinese name entity just seems more difficult.

Currently, there are two types of the main methods of information extraction, one is knowledge based library algorithm, this method needs to establish one A little rules, although the accuracy rate of this method is higher, the determination of this rule is relatively difficult, is had to author higher Requirement, and transplantability is not high；Another kind is the machine learning algorithm based on statistics, and this algorithm uses different models, and Learnt using the training set manually marked, then its relevant probability is calculated using model for new data set, and with this To obtain final result.This method cost is smaller, and performance is higher, convenient for transplanting, so being the hot spot of current research.

The relevant entity relation extraction technology of machine learning has mainly taken supervision entity relation extraction method and Weakly supervised Entity relation extraction method.There is the flow of supervision entity relation extraction method to be generally：Training text is pre-processed, relationship is carried out The handmarking of word pair and relationship, extraction feature vectorization are trained generation model with sorting algorithm, relationship are carried out with model Category label.Weakly supervised entity relation extraction method and having is pair in place of the main difference of supervision entity relation extraction method Mark the degree of dependence of language material.The a small amount of mark corpus of Weakly supervised entity relation extraction method, utilizes bootstrapping (self study) frame carries out entity relation extraction in conjunction with various sorting algorithms.

Weakly supervised entity relation extraction method is because use small-scale tagged corpus, performance poor.And there is supervision Entity relation extraction method relies on extensive tagged corpus, and this part need of work is according to task situation, manually into rower Note.Need to expend huge manpower and materials, use various algorithm training patterns on this basis, to the performance of the model of generation without Method accurately estimates that there are greater risks.

Invention content

The present invention indexes the acquisition of data set, mould to solve specific area present in existing entity relation extraction technology The acquisition of formula and coreference resolution problem provide a kind of name entity relation based on deep learning and extract and construction method.

Name entity relation provided by the invention based on deep learning extracts and construction method, for a certain specific neck Domain includes the following steps：

Step 1：Build crawlers, the news data on Vertical Website in crawl field；

Step 2：The news data of acquisition is pre-processed, junk information, including duplicate message, abnormal display are removed Information, coding mess code information etc.；

Step 3：News data is segmented, extracting keywords, dictionary is added in the keyword extracted, generates industry Dictionary；

Step 4：Chinese word segmentation is carried out again to news data using industry dictionary, obtains corresponding set of words；

Step 5：Seed dictionary is extracted, seed is the entity pair of setting；

Step 6：Unsupervised structure entity relationship network, specifically：It is extracted from news data comprising more than two entities Sentence extracts the verb in sentence and corresponding document；Entity term clustering based on deep learning is established to the document of extraction Model obtains probability distribution of the entity word on other words；According to the relationship between the word of verb description, entity relationship diagram is built Network；

Step 7：Entity relationship classification is defined, specifically：It is extracted from news data dynamic in the sentence comprising two entities Word clusters verb, and identical verb is classified as same class；

Step 8：Classify to entity relationship, specifically：To each entity pair in entity relationship network, it is based on step 7 cluster result carries out relationship classification.

Compared with the existing technology, name entity relation of the invention extracts and construction method, advantage and good effect exist In：

1. using unsupervised entity relation extraction, extensive manpower need not be put into, sample data is marked；

2. the dependence for corpus is low, using news information in common field as text, pumping is improved Take the performance of entity relationship；

3. the present invention is named entity by field and extracts, gibberish interferes between reducing different field, and it is accurate to extract result True rate is high.

Description of the drawings

Fig. 1 is the overview flow chart of specific area name entity extraction and construction method according to the ... of the embodiment of the present invention；

Fig. 2 is the overview flow chart of specific area industry keyword abstraction method according to the ... of the embodiment of the present invention；

Fig. 3 is the flow chart that specific area name entity according to the ... of the embodiment of the present invention extracts；

Fig. 4 is the flow chart that domain-specific relation template according to the ... of the embodiment of the present invention extracts.

Specific implementation mode

Below in conjunction with drawings and examples, the present invention is described in further detail.

In the embodiment of the present invention, the name based on deep learning of the present invention is illustrated in conjunction with this specific area of automobile Entity relation extraction and construction method.Including：The text collection of Automotive News is segmented；Based on self study bootstrap Method entity is extracted from the obtained cutting unit of participle to (automobile brand, automobile model), therefrom select a small amount of example and make For initial seed set；Method based on bootstrap extracts relationship templates from entity；And pass through depth learning technology, structure Relationship between entity is built, carrying out clustering/classification to relationship templates obtains relationship classification.

As shown in Figure 1, specific area according to the ... of the embodiment of the present invention, the present invention is based on the extractions of the name entity of deep learning With construction method, including it is as follows：

Step 1：Crawlers are built, capture the news data of Vertical Website, it includes automobile that the present invention, which implements mainly to use, Family, Pacific Ocean automobile data.Specific steps 1 are divided for following steps 101~102.

Step 101：Distributed reptile program is built, page crawl is carried out to Vertical Website data.

Step 102：The dom tree constructions that the page is generated according to the html pages grabbed, the page is climbed to according to tag extraction Middle contained text information.

Step 2：The news data of acquisition is pre-processed.Specific steps 2 are divided for step 201~202.

Step 201：It is cleaned according to news length, rubbish news is removed using regular expression and the rule set of formulation Information.

Step 202：News data is filtered using Bloom filter (Bloom filter), removal repeats news letter Breath.First then N number of hash values are calculated to subsequent comment using in N number of hash Function Mappings to bit array to news data, Judge whether the news data has existed.If the subsequent calculated hash values of comment are present in bit array, illustrate The comment data has existed, and filters this out.

Step 3：Extracting keywords form new industry dictionary.The present invention utilizes N-gram model extraction keywords, by institute The keyword of extraction, which is added, has basic dictionary, generates new industry dictionary.

Different from the Latin languages such as English, Chinese language text does not have the apparent separator such as space, therefore is carrying out Chinese language First step work seeks to carry out word segmentation to text when present treatment.It, will also be to carrying out word due to the needs of information extraction The later text of cutting is labeled.The present invention carries out Chinese word segmentation using ICTCLAS, and is excavated by keyword digging technology Automobile industry dictionary improves the precision of word segmentation.Keyword in the present invention further includes not only information content including discrimination, more lays particular stress on letter Breath amount.

Solidification degree PMI defined in the embodiment of the present invention is as follows：

PMI (a, b)=p (ab)/p (a) p (b)

PMI value is the solidification degree of word a and word b composition keywords ab, carrys out extracting keywords with this, wherein p (a) indicates word The frequency that a occurs, p (b) indicate that the frequency that word b occurs, p (ab) indicate the frequency that ab occurs.PMI has the shortcomings that a typical case：Incline To extracting the lower word of frequency, therefore the present invention specifically select word frequency more than the word of certain threshold value as candidate word when implementing, Remove the lower word of frequency.Using solidification degree defined herein come extracting keywords relative to other existing methods, by experiment Proof can remove more noise.PMI value also referred to as point mutual information (Pointwise Mutual Information) value.

Specific steps 3 are divided for step 301~step 305, as shown in Figure 2.

Step 301：Chinese word segmentation program is called, news data is tentatively segmented.

Step 302：Using 1-gram, the PMI value of word is calculated, PIM values is chosen and is more than the word of threshold value A as keyword.

Step 303：Using 2-gram, the PMI value of word is calculated, chooses word of the PMI value more than threshold value B as keyword.

Step 304：Using 3-gram, the PMI value of word is calculated, chooses word of the PMI value more than threshold value C as keyword.

Step 305：The keyword that step 302~step 304 is obtained and original dictionary merge, as what is segmented again Dictionary.

Threshold value A, B and C can be determined according to experiment.

Step 4：The industry dictionary obtained using step 3 carries out Chinese word segmentation processing again to news data, obtains and corresponds to Set of words.This step carries out Chinese word segmentation to all comment datas, removes stop words, obtains word segmentation result.

Step 4 includes step 401~step 402.

Step 401：It is segmented first, calls Chinese word segmentation program participle；Then, it is removed and is deactivated according to deactivated vocabulary Word carries out morphological transformation to English words wherein included, is transformed into unified expression-form.

After text is segmented and marked, text is expressed as a string of set of words being marked.At these There are many deactivated word in word.They are nonsensical to information extraction.In the present invention by a deactivated vocabulary by these Word is deactivated to reject.On the one hand the calculation amount of system can be reduced by doing so, on the other hand can improve in information extraction below Accuracy rate.When removing stop words, calculating sequence is simply carried out according to word frequency and document frequency, removes the highest word of word frequency.

Step 402：The document frequency df and word frequency tf for counting word, are calculated the reverse document-frequency idf of word, use meter The weights that formula log (tf* (idf+1)+1) calculates word are calculated, and comparison is carried out according to weight threshold D and carries out word set screening, will be weighed Word of the value more than threshold value D retains, to which extraction obtains to embody the set of words of news features, while by after threshold comparison Also the dimension of the corresponding set of words of news data is suitably reduced.

Step 5：Manual manufacture automobile brand and automotive type seed dictionary, bootstrap excavate automobile brand and vehicle Dictionary.

Word segmentation and mark are being carried out to text collection, filtered after deactivating word, in order to improve information extraction The range of extraction is limited to a suitable range by accuracy rate.It has to find out and occurs two name entities in same sentence To sentence.Find out the name entity pair in set contextual window.Entity abbreviation entity will be named below, name entity pair Abbreviation entity pair.Entity in the present invention to for<Automobile brand, automotive type>.In the embodiment of the present invention, automobile brand is one A entity, as soon as both automotive type is an entity, and the entity mentioned below refers to.

In order to realize the automatic extraction of relationship between entity, it is necessary to realize and provide certain relationship seed set.It can lead to Artificial method is crossed, a small amount of relationship seed set is provided.Due to manually merely providing a small amount of relationship seed set, for letter For breath extracts, this is inadequate.Pass through the extension of automatically trained method bootstrap implementation relation seeds.

Since the relationship between entity pair can be judged by the context between them.Above and below same or similar Two group objects of text are to same or analogous relationship.Context vector between computational entity pair and relationship seed can be passed through Similarity as the similarity between them.

This step includes step 501 and step 502, as shown in Figure 3.

Step 501：It is artificial to choose automobile brand and corresponding automotive type.A quantity of seeds is provided and seed extracts mould Plate, each seed are an entity pair.Particular number can be arranged as required to.Seed extraction template is for example：Such as (certain automobile product Board) publication (certain automotive type).

Step 502：Entity pair is excavated by bootstrap methods.By being closed between bootstrap method automatic mining entities System, can be continuously available seed extraction template, and seed is extracted according to seed extraction template again iteration.

Automobile brand is extracted in the embodiment of the present invention and the pseudocode of vehicle is as follows：

Step 6：Unsupervised structure entity relationship network, including step 601~step 604.It identifies first in each sentence Entity.For each sentence, the result being labeled to sentence is used.Then entity is built to the entity identified It is right, then carry out relationship classification.

Step 601：Extract it is all include two and more than two entity sentence, extract verb therein and corresponding Document.

Step 602：The verb extracted in step 601 is normalized and denoising, verb is corresponded into centrifugal pump 0 With 1, while removing and wherein repeating or meaningless verb.

Step 603：Entity word based on deep learning (Deep Learning) is established to the document extracted in step 601 Clustering Model obtains probability distribution of the entity word on other words.

Step 604：According to relationships such as the relationships between word, such as subject-predicate, dynamic guest, entity relationship network, the network are built Including the relationship between the entity of all verb descriptions extracted.The step of building entity relationship network is as shown in Figure 4.

The pseudocode that entity relationship network is built in the embodiment of the present invention is as follows：

The embodiment of the present invention builds word2vec models using deep learning, and point of word is obtained using word2vec models Cloth calculates the similitude between word according to the distribution of word, to realize the cluster of word.

Step 7：Define entity relationship classification.The verb in article is extracted, such as " purchase ", " cooperation ", " publication " is closed It is classification.Step 7 includes step 701~702.

Step 701：To pretreated news data in step 2 extract it is all include two entities sentence in it is dynamic Word.

Step 702：Verb is clustered again, obtains the classification of relationship, identical verb is classified as same class.

It is as follows to the pseudocode of entity relationship classification in the embodiment of the present invention：

Extract articles that contain more than 1entity extract the text of more than one entity Shelves；

Get all Verb between two entities obtain the verb between entity there are two institutes；

Using LDA cluster Verbs the verb obtained above is clustered using LDA Subject Clusterings model；

Get relation type as cluster result using the type of verb as cluster result.

Step 8：Classify to entity relationship.To each entity in entity relationship network to the cluster based on step 7 As a result relationship classification is carried out.An entity corresponds to a feature to relationship in entity relationship network, by extraction feature, based on step The rule that rapid 7 cluster is formed carries out relationship classification.

The entity relationship network obtained by step 6, the entity sets that can included, the embodiment of the present invention are automobile product Board set N and automotive type set O, to arbitrary n ∈ N, o ∈ O, structure entity is to (n, o).Due to only consider automobile brand with The relationship of automotive type, therefore when entity is to structure, automobile brand is placed above the other things always, automotive type is placed on second Position.And the sequence that they occur in sentence is then taken into account as feature in model learning and classification.For example, in sentence In " Toyota release trendy rav4 ", if identifying automobile brand " Toyota ", automotive type " rav4 ", then N={ Toyota }, O= { rav4 } then obtains entity to { (Toyota, rav4) }.

Claims

1. a kind of name entity relation based on deep learning extracts and construction method, for a certain specific area, feature exists In including the following steps：

Step 1：Build crawlers, the news data on Vertical Website in crawl field；

Step 2：The news data of acquisition is pre-processed, junk information, including duplicate message, abnormal display information are removed With coding mess code information；Pretreated news data is used for below step；

Step 201：It is cleaned according to news length, is believed using regular expression and the rule set of formulation removal rubbish news Breath；

Step 202：News data is filtered using Bloom filter Bloom filter, removal repeats news information；It is first First then N number of hash values are calculated to subsequent comment using in N number of hash Function Mappings to bit array to news data, judged Whether the news data has existed；If the subsequent calculated hash values of comment are present in bit array, illustrate that this is commented It has existed, and filters this out by data；

Step 3：News data is segmented, extracting keywords, dictionary is added in the keyword of extraction, generates industry dictionary；

When extracting keywords, is segmented using N-gram models, N=1,2,3, calculate the point mutual information PMI value of word, the threshold with setting Value compares, and will be greater than the word of threshold value as keyword；

PMI value PMI (a, b)=p (ab)/p (a) p (b) of word a and word b, wherein p (a) indicates the frequency that word a occurs, p (b) tables Show that the frequency that word b occurs, p (ab) indicate the frequency that ab occurs；

Step 4：Chinese word segmentation is carried out to news data using industry dictionary, obtains corresponding set of words；

Step 401：It is segmented first, calls Chinese word segmentation program participle；Then, stop words is removed according to deactivated vocabulary, it is right English words wherein included carry out morphological transformation, are transformed into unified expression-form；

Step 402：The document frequency df and word frequency tf for counting word, are calculated the reverse document-frequency idf of word, public using calculating Formula log (tf* (idf+1)+1) calculates the weights of word, and carries out word set screening according to weights and threshold value D comparisons, and extraction weights are big In the word of threshold value D, corresponding set of words is obtained, while passing through threshold comparison, reduce the dimension of the corresponding set of words of news data Degree；

Step 5：Seed dictionary is extracted, seed is the entity pair of setting；Manual manufacture a quantity of seeds first, then utilizes Bootstrap methods excavate entity pair from news data；

Step 6：Unsupervised structure entity relationship network, specifically：The sentence for including more than two entities is extracted from news data Son extracts the verb in sentence and corresponding document；Entity term clustering mould based on deep learning is established to the document of extraction Type obtains probability distribution of the entity word on other words；According to the relationship between the word of verb description, entity relationship diagram is built Network；

Step 7：Entity relationship classification is defined, specifically：The verb in the sentence comprising two entities is extracted from news data, it is right Verb is clustered, and identical verb is classified as same class；

Step 8：To each entity pair in entity relationship network, the cluster result based on step 7 carries out relationship classification.