CN108763192A

CN108763192A - Entity relation extraction method and device for text-processing

Info

Publication number: CN108763192A
Application number: CN201810348221.6A
Authority: CN
Inventors: 朱耀邦; 高翔; 纪达麒; 陈运文
Original assignee: Information Technology (shanghai) Co Ltd
Current assignee: Daguan Data Co ltd
Priority date: 2018-04-18
Filing date: 2018-04-18
Publication date: 2018-11-06
Anticipated expiration: 2038-04-18
Also published as: CN108763192B

Abstract

This application discloses a kind of entity relation extraction method and devices for text-processing.This method includes：Input pending text；Identify the entity in the pending text, wherein the pending text includes multiple entities；The entity is screened according to default sample to obtain the contextual feature of input example；The context similarity between each seed sample in the input example and seed sample library is calculated by the contextual feature；Judge whether the context similarity is more than the first predetermined threshold value；If the similarity is more than first predetermined threshold value, number of the similarity more than the seed sample of the predetermined threshold value is counted；Judge that whether the similarity is more than the number of the seed sample of the predetermined threshold value more than second predetermined threshold value；If the number that the similarity is more than the seed sample of the predetermined threshold value is more than second predetermined threshold value, using the entity relationship example for inputting example and being obtained as the text-processing.Present application addresses the high precision of rule and method low the technical issues of recalling.

Description

Entity relation extraction method and device for text-processing

Technical field

This application involves text-processing technical fields, are taken out in particular to a kind of entity relationship for text-processing Take method and device.

Background technology

With the fast development of internet, internet has become the main channel that people obtain information, on internet Text data also show explosive growth.Abundant information is contained in text data on internet, and structure is known Knowing library and knowledge mapping has very important effect；But manually progress relevant knowledge extraction workload is extremely huge, if Useful information can be gone out using Computer Automatic Extraction, that will have very important significance.However the textual data on internet According to be nearly all in the form of natural language existing for can not directly be handled without structured data, computer.

In order to solve this problem, information extraction technique comes into being, textual data of the information extraction technique from Un-structured Relationship between structural data, including entity, entity, event etc. are extracted in.Relation extraction is one in information extraction field Key technology usually identifies the entity in text by name entity recognition techniques, then identifies entity by Relation extraction technology Relationship between.The common method of Relation extraction includes：Rule-based method, unsupervised approaches have measure of supervision and half Measure of supervision.Rule-based method is there are clearly disadvantageous, and this method needs manual compiling largely regular, and workload is very Greatly, not easy care, cannot expand to other field well.When unsupervised approaches are clustered text, often effect is not Very well, there is a problem of that recall rate and preparation rate be not high, and need many manual interventions.

When carrying out relationship classification based on traditional machine learning algorithm, need manually to mark a large amount of training corpus, workload Greatly, and field transplantability and processing new relation can not be solved the problems, such as.And semi-supervised method mainly utilizes a small amount of mark Example is noted as initial seed set, then by continuous iteration, similar case extension kind is extracted from unstructured data Subclass, in view of the above-mentioned problems, currently no effective solution has been proposed.

Invention content

The main purpose of the application is to provide a kind of entity relation extraction method and device for text-processing, with solution Certainly the high precision of rule and method is low recalls problem.

To achieve the goals above, according to the one side of the application, a kind of entity pass for text-processing is provided It is abstracting method.

Include according to the entity relation extraction method for text-processing of the application：Input pending text；Identification institute State the entity in pending text, wherein the pending text includes multiple entities；The entity is sieved according to default sample Choosing obtains the contextual feature of input example；By the contextual feature calculate the input example and each seed sample it Between context similarity；Judge whether the context similarity is more than predetermined threshold value；If the similarity is more than described First predetermined threshold value then counts number of the similarity more than the seed sample of the predetermined threshold value；Judge the similarity Whether the number more than the seed sample of the predetermined threshold value is more than second predetermined threshold value；If the similarity is more than institute State the seed sample of predetermined threshold value number be more than second predetermined threshold value, then using the input example as the text at Manage obtained entity relationship example.

Further, include before the entity abstracting method starts：Training term vector model, specifically includes：It uses Gensim tools training background language material obtains the term vector model.

Further, identify that the entity in the pending text includes：Described in name entity recognition method acquisition Entity in pending text.Further, the context for screening to obtain input example to the entity according to default sample is special Sign includes：The pending text is segmented；Part-of-speech tagging is carried out to word segmentation result；Filtering part of speech annotation results are waited for Select word；The target word in the word to be selected is obtained using contextual window；Above and below target word composition input example Literary feature.

Further, to calculate the input example by the contextual feature similar to the context between seed sample Degree includes：Contextual feature substitution preset formula is obtained into the context similarity；The preset formula is：

Wherein, similarity indicates the context similarity.

To achieve the goals above, according to the another aspect of the application, a kind of entity pass for text-processing is provided It is draw-out device.

Include according to the entity relation extraction device for text-processing of the application：Input module inputs pending text This；Identification module identifies the entity in the pending text, wherein the pending text includes multiple entities, and structure is defeated Enter example (entity, entity 2 input text)；Screening module screens the entity according to default sample to obtain input example Contextual feature；Computing module is calculated upper between the input example and each seed sample by the contextual feature Hereafter similarity；First judgment module, judges whether the context similarity is more than the first predetermined threshold value；Statistical module, such as Similarity described in fruit is more than the predetermined threshold value, then counts seed sample of the similarity more than first predetermined threshold value Number；Second judgment module, for judging that whether the similarity is more than the number of the seed sample of the predetermined threshold value more than institute State the second predetermined threshold value；Terminate module, if the number that the similarity is more than the seed sample of the predetermined threshold value is more than institute The second predetermined threshold value is stated, then the entity relationship example obtained the input example as the text-processing.

Further, the entity relation extraction device further includes：Training module, for training term vector model, specifically Including：The term vector model is obtained using gensim tools training background language material.

Further, the identification module includes：Entity acquisition module is waited for using described in name entity recognition method acquisition Handle the entity in text.

Further, the screening module includes：Word-dividing mode, for being segmented to the pending text；Mark Module carries out part-of-speech tagging to word segmentation result；Filtering module, filtering part of speech annotation results obtain word to be selected；Target word obtains mould Block, for obtaining the target word in the word to be selected using contextual window；Contextual feature generation module, it is described for obtaining Target word constitutes the contextual feature of input example.

Further, the computing module includes：Module is substituted into, for obtaining contextual feature substitution preset formula Go out the context similarity.

In the embodiment of the present application, it in such a way that term vector model is combined with context similarity, is inputted by calculating Similarity between example and seed sample, is compared with predetermined threshold value, obtains the sample for meeting target, has reached reality The purpose of body Relation extraction to realize the technique effect for the recall rate for promoting Relation extraction, and then solves rule and method High precision low the technical issues of recalling.

Description of the drawings

The attached drawing constituted part of this application is used for providing further understanding of the present application so that the application's is other Feature, objects and advantages become more apparent upon.The illustrative examples attached drawing and its explanation of the application is for explaining the application, not Constitute the improper restriction to the application.In the accompanying drawings：

Fig. 1 is the entity relation extraction method schematic diagram for text-processing according to the embodiment of the present application；

Fig. 2 is the generation contextual feature schematic diagram according to the embodiment of the present application；

Fig. 3 is the entity relation extraction schematic device for text-processing according to the embodiment of the present application；

Fig. 4 is the screening module schematic diagram according to the embodiment of the present application；And

Fig. 5 is the method operational flowchart according to the embodiment of the present application.

Specific implementation mode

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, technical solutions in the embodiments of the present application are clearly and completely described, it is clear that described embodiment is only The embodiment of the application part, instead of all the embodiments.Based on the embodiment in the application, ordinary skill people The every other embodiment that member is obtained without making creative work should all belong to the model of the application protection It encloses.

It should be noted that the term " comprising " in the description and claims of this application and above-mentioned attached drawing and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing series of steps or unit Process, method, system, product or equipment those of are not necessarily limited to clearly to list step or unit, but may include without clear It is listing to Chu or for these processes, method, product or equipment intrinsic other steps or unit.

It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

As shown in Figure 1, this application involves a kind of entity relation extraction method for text-processing, this method includes as follows Step S101 to step S106：

Step S101 inputs pending text；

Pending text can be contained：The structural data extracted from the text data of Un-structured is needed, Include but not limited in pending text, relationship, event etc. between entity, entity.

Step S102 identifies the entity in the pending text, wherein the pending text includes multiple entities；

Identify that the mode of the entity in pending text is to obtain the pending text using name entity recognition method In entity.

Step S103 screens the entity according to default sample to obtain the contextual feature of input example；

As the preferred of the present embodiment, as shown in Fig. 2, wherein step S103, screens the entity according to default sample The contextual feature for obtaining input example includes the following steps S201 to step S205：

Step S201 segments the pending text；

Step S202 carries out part-of-speech tagging to word segmentation result；

Preferably, word segmentation result is labeled as：Noun, verb, adverbial word etc..

Step S203, filtering part of speech annotation results obtain word to be selected；

Preferably, only retain the verb and noun in the word to be selected.

Step S204 obtains the target word in the word to be selected using contextual window；

Preferably, based on context window (a, b, c, d) obtains context [left1, right1, left2, right2], Wherein left1, right1, left2, right2 are respectively a, 1 left side of entity word, b, the right word, c, 2 left side of entity word, the right side D, side word.If practical word number is less than window size, whole words are taken.

Step S205 constitutes input example context feature according to the target word.

Step S104 calculates the context between the input example and each seed sample by the contextual feature Similarity；

Preferably, contextual feature substitution preset formula is obtained into the context similarity；The preset formula For：

Wherein, similarity indicates the context similarity.

Step S105, judges whether the context similarity is more than the first predetermined threshold value；

Step S106 counts the similarity more than described if the similarity is more than first predetermined threshold value The number of the seed sample of predetermined threshold value；

Step S107 judges that the similarity is more than the number of the seed sample of the predetermined threshold value and whether is more than described the Two predetermined threshold values；

Step S108, if the number that the similarity is more than the seed sample of the predetermined threshold value is more than described second in advance If threshold value, then using the entity relationship example for inputting example and being obtained as the text-processing.

It should be noted that step shown in the flowchart of the accompanying drawings can be in such as a group of computer-executable instructions It is executed in computer system, although also, logical order is shown in flow charts, and it in some cases, can be with not The sequence being same as herein executes shown or described step.

According to the embodiment of the present application, a kind of device for implementing the above method is additionally provided, as shown in figure 3, the device Including：Input module 10 inputs pending text；Identification module 20 identifies the entity in the pending text, wherein institute It includes multiple entities to state pending text；Screening module 30 screens the entity according to default sample to obtain input example Contextual feature；Computing module 40 is calculated by the contextual feature between the input example and each kind of sub-instance Context similarity；First judgment module 50, judges whether the context similarity is more than the first predetermined threshold value；Statistical module 60, if the similarity is more than the predetermined threshold value, count the seed that the similarity is more than first predetermined threshold value Sample number；Second judgment module 70, for judging that the number of seed sample that the similarity is more than the predetermined threshold value is It is no to be more than second predetermined threshold value；Terminate module 80, if the similarity is more than the seed sample of the predetermined threshold value Number is more than second predetermined threshold value, then the entity relationship reality obtained the input example example as the text-processing Example.

As shown in figure 4, screening module 30 includes：Word-dividing mode 301, for being segmented to the pending text；Mark Injection molding block 302 carries out part-of-speech tagging to word segmentation result；Filtering module 303, filtering part of speech annotation results obtain word to be selected；Target Word acquisition module 304, for obtaining the target word in the word to be selected using contextual window；Contextual feature generation module 305, the contextual feature of input example is constituted for obtaining the target word.

As shown in figure 5, the method operational flowchart of the present invention is specific as follows：

Seed sample generates, and writes some rule templates according to domain knowledge, identifies designated entities relationship.Rule template is most Amount is stringent, it is ensured that high-accuracy.In addition, rule template answers the expression way of covering relation as much as possible.It is identified in rule After candidate seed sample, by artificial filter, the sample of mistake is removed, obtains final seed sample in this way.

Training term vector model, term vector method is that Hinton was proposed in 1986, by one low-dimensional real number of word Vector indicates, such as [0.179, -0.157, -0.117,0.909, -0.532 ...] this form, that is, term vector.And And in term vector space, two small points of vector angle, the word representated by them is semantically similar or related.Compared with The term vector that good training algorithm obtains, can preferably reflect the similarity between word semantically.

The similitude similarityX, Y of word X and word Y is calculated with COS distance：

The present embodiment trains term vector using gensim tools.The language material used is full field news corpus.Vector dimension For 128 dimensions.

Sample contextual feature generates, and sample is a triple (entity 1, entity 2, content of text).For what is given Sample, we segment content of text, part-of-speech tagging, name Entity recognition, obtain following form result [w0/tag0, W1/tag1 ..., wi-1/tagi-1, entity 1, wi+1/tagi+1 ..., wj-1/tagj-1, entity 2, wj+1/tagj+1 ..., wk/tagk].It is filtered by part of speech, only retains verb, noun.Based on context window (a, b, c, d) obtain context [left1, Right1, left2, right2], wherein left1, right1, left2, right2 are respectively a, 1 left side of entity word, the right b A word, c, 2 left side of entity word, d, the right word.If practical word number is less than window size, whole words are taken.Finally according to training Good term vector model, the vector for obtaining contextual feature indicate [[vj-a ..., vj-1], [vj+1 ..., vj+b], [vk- C ..., vk-1], [vk+1 ..., vk+d]].

Sample similarity calculation generates contextual feature to candidate sample, and calculates the phase with each seed sample successively Like degree.For candidate sample feature [[wj-a ..., wj-1], [wj+1 ..., wj+b], [wk-c ..., wk-1], [wk+ of input 1 ..., wk+d]] and seed sample feature [[vj-a ..., vj-1], [vj+1 ..., vj+b], [vk-c ..., vk-1], [vk+ 1 ..., vk+d]], weight vectors [[f1 ..., fa], [fa+1 ..., fa+b], [fa+b+1 ..., fa+b+c], [fa+b+c+ 1 ..., fa+b+c+d]], calculating formula of similarity is as follows

Here the physical length of two feature vector windows is not necessarily identical, and common point is taken when calculating molecule, calculates and divides The actual size of seed sample feature vector window is taken when female.

It is phase of the candidate sample relative to seed sample it should be pointed out that similarity here and being unsatisfactory for symmetry Like degree.

Seed sample extends, and for the corpus of input, traverses every document therein, to document by big punctuate (fullstop, Question mark etc.) carry out subordinate sentence.

To each big sentence, it is named Entity recognition first, if including the entity of two specified types, constructs candidate sample Example (entity 1, entity 2, content of text).Otherwise next processing is carried out.

The contextual feature of the candidate sample of construction, calculates the similarity of candidate sample and each sample in seed sample library, And count the sample number that similarity is more than given threshold value.If obtained sample number is more than given threshold value (such as current kind of increment The 10% of number of cases), then candidate sample is added in sample library, otherwise carries out next processing.

It can be seen from the above description that the application realizes following technique effect：By with identical entity relationship Entity to similar context, based on sample context similarity extension sample library, can effectively promote Relation extraction Recall rate.By training term vector model, it is trained using extensive general language material.Context phase is carried out based on term vector It is calculated like degree, generalization ability can be obviously improved.

Obviously, those skilled in the art should be understood that each module of above-mentioned the application or each step can be with general Computing device realize that they can be concentrated on a single computing device, or be distributed in multiple computing devices and formed Network on, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, either they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.In this way, the application be not limited to it is any specific Hardware and software combines.

The foregoing is merely the preferred embodiments of the application, are not intended to limit this application, for the skill of this field For art personnel, the application can have various modifications and variations.Within the spirit and principles of this application, any made by repair Change, equivalent replacement, improvement etc., should be included within the protection domain of the application.

Claims

1. a kind of entity relation extraction method for text-processing, which is characterized in that including：

Input pending text；

Identify the entity in the pending text, wherein the pending text includes multiple entities；

The entity is screened according to default sample to obtain the contextual feature of input example；

By the contextual feature calculate it is described input example and seed sample library in each seed sample between up and down Literary similarity；

Judge whether the context similarity is more than the first predetermined threshold value；

If the similarity is more than first predetermined threshold value, the seed that the similarity is more than the predetermined threshold value is counted The number of sample；

Judge that whether the similarity is more than the number of the seed sample of the predetermined threshold value more than second predetermined threshold value；

If the number that the similarity is more than the seed sample of the predetermined threshold value is more than second predetermined threshold value, by institute State the entity relationship example that input example is obtained as the text-processing.

2. entity relation extraction method according to claim 1, which is characterized in that before the entity abstracting method starts Including：

Training term vector model, specifically includes：The term vector model is obtained using gensim tools training background language material.

3. entity relation extraction method according to claim 1, which is characterized in that the reality in the identification pending text Body includes：

Entity in the pending text is obtained using name entity recognition method.

4. entity relation extraction method according to claim 1, which is characterized in that sieved to the entity according to default example Choosing obtain input example contextual feature include：

The pending text is segmented；

Part-of-speech tagging is carried out to word segmentation result；

Filtering part of speech annotation results obtain word to be selected；

The target word in the word to be selected is obtained using contextual window；

Input example context feature is constituted according to the target word.

5. entity relation extraction method according to claim 1, which is characterized in that calculate institute by the contextual feature Stating the context similarity inputted between example and each seed sample includes：

Contextual feature substitution preset formula is obtained into the context similarity；

The preset formula is：

Wherein, similarity indicates the context similarity.

6. a kind of entity relation extraction device for text-processing, which is characterized in that including：

Input module inputs pending text；

Identification module identifies the entity in the pending text, wherein the pending text includes multiple entities；

Screening module screens the entity according to default sample to obtain the contextual feature of input example；

It is similar to the context between each seed sample to calculate the input example by the contextual feature for computing module Degree；

First judgment module, judges whether the context similarity is more than the first predetermined threshold value；

It is default more than described first to count the similarity if the similarity is more than the predetermined threshold value for statistical module The seed sample number of threshold value；

Second judgment module, for judging that whether the similarity is more than the number of the seed sample of the predetermined threshold value more than institute State the second predetermined threshold value；

Terminate module, if the number that the similarity is more than the seed sample of the predetermined threshold value is more than the described second default threshold Value, then the entity relationship example obtained the input example as the text-processing.

7. entity relation extraction device according to claim 6, which is characterized in that the entity relation extraction device also wraps It includes：Training module is specifically included for training term vector model：Institute's predicate is obtained using gensim tools training background language material Vector model.

8. entity relation extraction device according to claim 6, which is characterized in that the identification module includes：

Entity acquisition module obtains the entity in the pending text using name entity recognition method.

9. entity relation extraction device according to claim 6, which is characterized in that the screening module includes：

Word-dividing mode, for being segmented to the pending text；

Labeling module carries out part-of-speech tagging to word segmentation result；

Filtering module, filtering part of speech annotation results obtain word to be selected；

Target word acquisition module, for obtaining the target word in the word to be selected using contextual window；

Contextual feature generation module constitutes the contextual feature of input example for obtaining the target word.

10. entity relation extraction device according to claim 6, which is characterized in that the computing module includes：

Module is substituted into, for contextual feature substitution preset formula to be obtained the context similarity.